Musing on Intelligence– Part 7: Learning and Memory

Standard

This is Part 7 of a multi-part series where I explore various topics on intelligence bringing together ideas from biology, brain studies, psychology, computer science and engineering. The last section of this post contains links to other posts in this series.   In previous post I described a primal architecture for basic survival, but I didn’t provide much implementation details.  In this post I will begin that process; however, some preliminaries on learning and memory are required, so I will first discuss the topic of learning and memory, including the concept of resonance, self-awareness and self-consistency.

Learning as Association

Learning, a critical part of life regulation and survival, gives organisms ability to adapt to ever-changing environments, has long been studied by psychologists, neural scientists and computer scientists.  In this blog post, I define learning as the process of semi-permanent change to how an organism respond to stimuli.  At the lowest level, learning is about associating stimuli.  Hebbian learning, a well-known theory proposed by Donald Hebb, stated that the synapse between two neurons is strengthened when both neurons are active.  He based his theory on observing that synaptic signal conductivity is modulated by the simultaneous pre- and post-synaptic neuronal activation.

screen-shot-2017-01-16-at-2-31-24-pm

Pavlov’s dog experiment

Association can be conditioned.  Remember the famous Pavlov’s dog experiment illustrating Classical Conditioning?  When food is presented to the dog, the dog will salivate.  Food is the unconditional stimuli (UCS), and salivation is the response (R).  Pavlov found that if a bell is rung as the food is presented to the dog, with enough repetition, the dog will salivate when the bell is rung, even without the food.  In this scenario, the bell is the conditioned stimuli (CS).   This type of associative learning has immediate survival value, that organism with this capability will be able to anticipate arrival of events and be prepared.

A Simple Associative Network

Classical conditioning can be emulated using a simple network with Hebbian learning. Consider the circuit below:

screen-shot-2017-01-18-at-10-55-18-am

Classical conditioning demonstrated by a simple Hebbian neural network

Initially unconditioned stimulus (UCS) elicits unconditional response (UCR).  Later condition stimulus (CS) is presented with UCS .  With CS and UCR  both present, Hebbian learning strengthens their association such that even when UCS is later removed, CS alone can trigger UCR.  In Pavlov’s experiment, UCS equates to food, UCR equates to salivation, and CS equates bell ring.  Let’s rephrase my description above with the substitutions:  Initially food elicits salivation.  Later, bell ring is presented along with food.  With bell ring and salivation both present, Hebbian learning strengthens their association such that even when food is later removed, bell ring alone can trigger salivation.

The above mechanism, I would suggest, is the simplest form of planning.  Consider if a bell rings around the time when food is presented, after conditioning, bell ring can lead the dog to expect and prepare to receive the food by running to the door and wait.  It’s also worth noting that while bell ring can trigger salivation after conditioning, it could not satisfy hunger and therefore its association with salivation will eventually fade if unaccompanied by food.   In general, stimuli that do not lead to homeostatic satisfaction will fade.

How does one implement associative learning in robots?  There are numerous approaches, including various forms of artificial neural networks, expert systems, and traditional programming.  Philosophically, I believe biological systems offer valuable insights on how learning can be implemented, but AI does not have to be biomimic; rather focus should be on understanding principles unveiled by biological systems.

Implementing Learning

Some of the most popular learning algorithms today are supervisory learning where desired output and associated stimuli are presented to the system in batch and then iterated to modify the connections until the training error converges to a global minima or local minima.  The operation is not unlike  curve fitting where a model of the transfer function is approximated.  In neural networks, the transfer function is defined by the network topology and the model parameters learned from training.

cat

Hubel & Wiesel’s cat experiment

Often the raw inputs are not used directly for training.  Instead salient features are first extracted to reduce dimensionality of the input.  It has been shown that biological systems such as cats, have hardwired feature detectors.  For object recognition, features can be edges, blobs, or more complex features such as SIFT.  Biological equivalent of features are receptive fields.  Neighboring receptive fields project to the next layer of cells as a “voting block”.  Below is an example of a neighborhood of co-linear receptive field in LGN projecting their outputs to a V1 cell.  Each receptive field has an on-center, off-surround response, each individually would have responded to a point, but when combined they respond to a line.  Note the V1 cell will have its greatest response if all three receptive fields are active. A line or a line segment would have activated all three receptive fields which in turn activating V1, making V1 an edge detector.

68747470733a2f2f7261772e6769746875622e636f6d2f7175696e6e6c69752f436f6d7075746174696f6e616c4e6575726f736369656e63652f6d61737465722f696d61676573466f724578706c616e6174696f6e2f4d656368616e69737469634d6f64

On-center, off-surround receptive fields project to a common V1 cell making it an edge detector.

Convolution neural network (CNN), often used in deep learning, has as its first layer feature detectors.  These detectors, equivalent to receptive fields in biological vision, are implemented by convolving a small kernel with the input image, resulting in a heat map for that feature.  The heat maps are then subsampled to reduce dimensionality and increase generalization, a process commonly refer to as ‘pooling’.  Finally, the pooled layer output feeds a multi-layer back-propagation network whose learning is based on gradient descent algorithm.  Deep learning networks are close relatives of  back-propagation networks, with CNN adding convolution preprocessing and pooling.  Below is a CNN example with two convolution-subsampling layers.  The final fully connect layers basically form a back-propagation network.

typical_cnn

Typical convolution neural network (CNN)

Short-Term vs. Long-Term Memory

Learning requires memory so adaptation can persist.  Memory can both remember and forget.  Forgetting reallocates memory to store more recent and salient stimuli.  Memory can be short-term (STM) and long-term (LTM).  In biological system, STM is equivalent to the electrochemical potential of cell body, and LTM is the strength of synaptic connections.  In artificial neural networks, STM is modeled by the feedforward activation functions, and LTM is modeled by Hebbian learning or optimization algorithms like steepest descent that ultimately modifies the synaptic weights.

Declarative vs. Procedural Memory

Memory is classified according to what they record.  Psychologists defines two types of memories: declarative memory and procedural memory.  Declarative memory remembers the whats, such as facts, data and events.  The deep learning network described earlier is an example of declarative memory that recalls shapes or objects.  Procedural memory remembers the hows, such as dribbling a ball, swinging a bat, or speak.  Procedural memory differs from declaration memory in that it encodes and recalls spatiotemporal sequences.   Procedural memory can be emulated using Markov chain, such as Hidden Markov Model used in speech recognition, and avalanche network, and various forms of recurring networks, or LSTM.

300px-hiddenmarkovmodel-svg

Hidden Markov Model

Human brain stores declarative knowledge and procedural knowledge in separate parts of the brain.  Scientists have learned that temporal lobe embodies declarative knowledge, while prefrontal cortex and motor cortex embody procedural knowledge.  Temporal lobe is capable of audio comprehension and visual recognition, and PFC is capable of initiating movements while the motor cortex maintains the action.  The way procedural knowledge initiates and expresses itself suggests PFC operates at a higher level of abstraction (e.g., symbolic), and is the planning center.  Separating declarative and procedural memory is quite logical–reaching and grasping an object should not require one to have to first recognize the object.

Perfect Recall from Imperfect Stimuli

Biological memory is capable of recalling learned pattern in face of imperfect input. Similarly Content Addressable Memory (CAM) in modern computer systems and Hamming classifier can recall pre-stored contents when given  contents that is an approximation of the original.   The ability to recall previously trained data with “close-enough” input is a form of generalization, an important attribute of intelligent systems.  In real world perfect input is rare; therefore such generalization is a desirable feature. Content addressable memories generally have inefficient or “sparse” storage in that the number of data that can be stored and recalled is a small fraction of the total size of the memory. The storage inefficiency is in essence traded with generalization.

bprop

Gradient descent algorithm tends to converge to a local minima.

Learned data points can be thought of as local minima on an error surface.  Students of back-propagation network would know that gradient descent algorithm usually converges to a suboptimal, local minima.  The algorithm updates the weights by shaping the error surface such that the training points correspond to local minima.  The algorithms requires a “teacher” to provide the “answers” during training, so error can be computed, and back-propagated through layers using appropriate credit assignment.  Gradient descent learning can be viewed as a type RMS error feedback where the error shapes the local minima.  If training is done sequentially, training point by training point rather than by batch, the error surface will not be shaped uniformly, but favoring earlier training points.

Feedback and Resonance as Cognition

It is well known that feedback brings stability to dynamic systems; and stability are equilibria.  If system is stable, small perturbation in the input or in the system will eventually settle at an equilibrium.  Similarly, feedback in intelligent system, especially connectionist models, brings stability.  But what does stability means in this context?  I submit that stability is a state of recognition or self-consistency.   Let’s take pattern recognition for example, how is an input pattern recognized?  It is recognized when the input matches a previously learned pattern.  This implies existence of feedback–the forward path presents the input pattern–the feedback path recalls a previous learned pattern and compare it with the input.  If the two patterns are close enough, the forward and feedback paths become self-consistent and the system is in resonance. If the patterns do not match, a new pattern is recalled and the cycle continues until a match is found, or all learned patterns have been exhausted.

art

Adaptive Resonance Theory (ART) network developed by S. Grossberg.

The Adaptive Resonance Theory (ART) developed by Stephen Grossberg implements such system.  The model has input, a STM layer (F1) that captures the input, and a symbol layer (F2).  Pattern captured in F1 excites the winning node in F2; then the winning node in turn recalls its associated pattern that is then compared with the pattern in F1.  If F1->F2 forward pattern (F1->F2) matches the recalled pattern (F2->F1), resonance occurs, and the pattern is recognized.

Resonance in Self-Awareness and Self-Consistency

It seem plausible that resonance may be the basis of self-awareness.  For instance, how would a person know if a body part belongs to him?  The answer is if the body part responds to his command. In this scenario, there is a resonance between the top-down motor command and the bottom-up, perceived response.  After self-exploration, elements in the environment that responds to self-directed commands become part of self–a domain of self emerges as result.  This notion has correlates in childhood sensorimotor development.  Jean Piaget discovered that a child’s babbling can leads to command of language; and a child’s randomly waving of arms develops the child’s eye-hand coordination.  This is sometimes termed “circular reaction learning”.  In deed, recent research, including my own master thesis, demonstrated that robots can learn eye-hand coordination through arm babbling, no longer dependent on hard-coded inverse kinematics incapable of adapting to physical changes.

Key Takeaways

In this blog post, I discussed learning and memory primarily from a connectionist perspective, and tied various artificial neural network with related principles and observations founded in biology.  I also discussed the concept of resonance and argued that it is the basis for self-awareness and learning by self-consistency.  As one designs intelligent robots, these neural networks become part of the building blocks, and the principles discovered become a guide.  The next blog post I will explore where and how expert system and symbolic processing may be applied and combined with connectionist processing.

Further Readings

(The above article is solely the expressed opinion of the author and does not necessarily reflect the position of his current and past associations)