History and Principles of Neural Networks From 1960 to 1990
by David D. Olmsted (Copyright - 1998, 1999, 2006. Free to use for personal and
educational purposes)
Last Revised August 27, 2006
The Classic Perceptrons (1962)
Figure 1
Classic Perceptron: Normalized Inputs
 |
In 1962 Frank Rosenblatt published a book which
combined the concepts of his original perceptron with those of ADALINE to
come up with the classic perceptron design shown in figure 1. In contrast
to ADALINE, perceptrons are based on repulsive learning in which only the
weights on the non-active lines are changed in response to an error. In other
words the weights change only in response to a misclassification. Thus the
weight values are not pulled towards some defined goal but are pushed away
from non-goals. Consequently each subcircuit can represent a whole class of
patterns.
The adaptive multiplication factors (weights) are now placed before
the summation node like ADALINE instead of after the node as in the original
perceptron. In addition all convergent subcircuits now share a common set of inputs
instead of having randomly connected inputs (although the initial values of the
weights may be randomized which would effectively accomplish the same thing).
These changes allowed the input pattern to dispense with the binary line signal
requirement in favor of analog signals which could represent
the frequency of an action potential pulse or the ionic charge on a neuron.
Yet, in order for patterns to be reliably discriminated by perceptrons the
pattern inputs had to be normalized, that is the numbers in each pattern had
to add up to the same value - usually one. Using analog values (and thus analog
equations) also required that the binary threshold be replaced with a subtractive
threshold.
Figure 2
Classic Perceptron: Non-Normalized Inputs
 |
Figure 2 shows the effect of non-normalized input patterns. The values of pattern one
add up to 1.0 yet the values of pattern two add up to 1.2. No combination of weight
values or threshold values will allow each of these patterns to have their
own unique convergent subcircuit output.. However,
if an additional weight is placed after the summation operation then classifications
of non-normalized patterns are possible. Yet this does not seem to have been
done for the manipulation of post summation node value is not easily incorporated
(mathematically) into the learning procedures used to find the pre-summation
operation weight values.
As was seen with ADALINE changing convergent subcircuit
weights only shifts the angle and height of the equal value lines but since the perceptron uses repulsive learning that equal value line
now becomes the basis for defining the border between pattern classes. The
equal value graph for the figure 1 example is shown in figure 3. The axis's
of the graph list the values of the pattern input lines which will only produce
an output from the subcircuit's threshold if they are above or to the right
of their equal value line. Thus the value on input "B" must be above .625 in
order for the top subcircuit (represented by the red line) to produce an output.
Since the top subcircuit has a zero valued weight on the "A" line it can be
any value. In contrast the input values for the bottom subcircuit must be
above the blue line for it to produce an output. Since the greatest valued
output is the one selected the input with the greatest effective distance
from its equal value line is selected. Consequently, the perceptron has the
same linear limitation as the ADALINE although in this case it is called
linear separability.
Figure 3
Pattern Separation Space for Fig. 1
 |
During learning the weights are changed according to an
increment rule such that the weights on a misclassified pattern are increased
by an amount proportional to the total error. This has the effect of
moving the equal value line further upward and rightward. The process of finding
the weight values giving the greatest pattern separation is complex and requires
the use of various optimization procedures. These optimization procedures
work best when the changes they command all have the same effect on the process
being optimized. Since the subtractive threshold has different effect from
the weights it was generally set to some fixed value and ignored. Full optimization
as opposed to piecewise optimization requires that the whole problem be considered
with all possible inputs. Consequently, all the perceptron optimization procedures
in the literature require that all the patterns be known before the optimization
procedure begins which really limits their use.
These optimization procedures work by initializing the weights
to some low value, not necessarily random. All the patterns are presented
and for those patterns which are incorrectly classified, the degree of mismatch
between the response of the correct convergent subcircuit and the incorrectly
responding convergent subcircuit is noted. All these mismatches are added
together in some fashion (depending on the type of optimization procedure)
to give the total error which is to be minimized and all the weights on the
incorrectly responding templates are incremented upwards in proportion to the total
error. (for a comprehensive mathematical review of all perceptron improvements
centered around linear discriminant functions see chapter 5 of Duda and Hart).
Rosenblatt summed up perceptrons in this passage from his 1962 book (page
28):
"Perceptrons are not intended to serve as detailed copies of any actual
nervous system. They're simplified networks, designed to permit the study of
lawful relationships between the organization of a nerve net, the organization of
its environment, and the 'psychological' performances of which it is capable.
Perceptrons might actually correspond to parts of more extended networks and
biological systems; in this case, the results obtained will be directly applicable.
More likely they represent extreme simplifications of the central nervous
system, in which some properties are exaggerated and others suppressed. In
this case, successive perturbation and refinements of the system may yield
a closer approximation."
The 1960's also saw the growth of artificial intelligence
techniques based mostly on net search techniques and higher order logic languages
known as propositional and predicate calculus. The aim of artificial intelligence
(A.I.) researchers was to simulate intelligent processes at a level more abstract
than that of the direct
neural level and they were having good success at the time.
They could see what was required of any intelligent machine so when the supporters
of perceptrons began to oversell its potential one of A.I.'s founders, Marvin
Minsky (who had started out with neural networks), with Seymour Papert were
inspired in 1969 to write a book describing the Perceptron's inherent limitations.
Since the perceptron was the most sophisticated neural network idea at the
time that book essentially ended neural network research in the United States
for a period of time.
Duda, R.O. & Hart, P.E. (1973).
Pattern Classification
and Scene Analysis, John Wiley & Sons, New York
Minsky, M. & Papert, S. (1969).
Perceptrons, An Introduction to Computational Geometry, MIT Press,
Cambridge, MA
Rosenblatt, F (1962).
Principles of Neurodynamics: Perceptrons
and the Theory of Brain Mechanisms, Spartan books, Washington D.C.
The Association Networks of Kohonen and Anderson (1972)
Still enamored by the idea that
brain memory is based upon distributed pathways the next stage in neural network
research was to avoid the pattern classification problems as represented by
the perceptron and instead concentrate on how these memories might be formed.
This stage in neural network research began in 1972 with the publication of
two papers on the subject by different author's working independently of each
other. One was by James Anderson who was inspired by the William James - Donald
Hebb model of memory for he states in his paper:
"If a group of neurons projects
to another we shall show that strengthening or weakening the synaptic connection
between the two groups according to a simple multiplicative function of activity
in pre and postsynaptic cells automatically generates an interactive memory
..." (page 182)
Figure 4
An Association Network
 |
In a similar way, Teuveo Kohonen from the University of Helsinki
in Finland was inspired by an idea prevalent at the time which was that memory
may be holographic in nature. The result was a network identical to that proposed
by Anderson. Interestingly, both used matrix mathematics to describe their
ideas and apparently because of this did not realize that what they produced
was an array of analog ADALINE circuits.
Figure 4 shows such a network,
the purpose of which is to associate or correlate an input pattern (vector) with
some desired output pattern (vector). The weights for each convergent subcircuit
represent a row in a matrix such that the calculations done by each convergent
subcircuit itself is the inner (dot) product between the input vector and
the weight vector. Consequently, figure 15 represents a square 3 row
by 3 column matrix. An association network is different from perceptron like
networks in that a presented pattern is supposed to activate a set of output lines
instead of just one. Consequently, a single convergence for a decision is not required in association networks.
Producing an an output pattern which matches exactly the desired
output pattern
requires that the input patterns don't overlap thus avoiding the subset problem.
In vector terminology: this is the same thing as saying the input pattern vectors need to be orthogonal. Alternatively, a successful matches can also occur if all
input patterns have differing overall magnetudes (distances) that are proportional
to their desired output patterns. Figure 4 shows the failure of pattern association
with overlapping and
normalized (adding up to 1.8 in this case and thus having the
same overal magenetude) patterns using only positive inputs.
Yet such a correlation is possible using a mix of negative inputs (represented
by negative valued weights) and positive inputs and that the Widrow-Hoff procedure
for learning the weight values will work in most cases.
The Widrow-Hoff procedure
has trouble when both high valued patterns are mixed with low valued patterns.
In this case the weight values for higher valued inputs will tend to oscillate
wildly while the weights associated with the low valued inputs would change
very little thus preventing any convergence. In order to minimize this problem
the Widrow-Hoff procedure is often modified for use in these association network
by making the weight changes proportional to the product of the input values
and the desired output values instead of just proportional to the input values.
This tends to average out the value variations at the cost of longer learning
times. Still, to be absolutely sure of convergence where such solutions are
possible, other more complex algorithmic (not likely to be realizable in a biological
neural setting) optimization procedures are be needed. Solutions are even less likely
to occur if the input patterns are not normalized which results in more situations
in which the convergent subcircuits have identical values for most of their
input values leaving less "room" for the remaining weights to adjust themselves
to achieve the desired output values. For an extensive mathematical review
of this type of neural network with many its many variations and weight change
rules see the 1984 book by Kohonen.
Anderson, James A (1972). A Simple Neural Network
Generating an Interactive Memory,
Mathematical Biosciences 14:197-220
Kohonen, Teuvo (1972). Correlation Matrix Memories,
IEEE Transaction on Computers, C-21:353-359
Kohonen, Teuvo (1984).
Self-Organization and Associative Memory, Springer-Verlag,
Berlin
The Cognitron - First Multilayered Network (1975)
Figure 5
The Repeatable Unit of the Cognitron
 |
In 1975, inspired
by the self-organization ability of the brain, Kunihiko Fukushima from Japan
introduced the Cognitron network as an extension of the Perceptrons.
Like the Original Perceptron the Cognitron is a pattern regularity detector
meaning it is able to learn patterns without some mechanism (a teacher) to
indicate the success or non-success of a pattern match. Unlike the original
Perceptron the Cognitron is better able to handle (but not perfectly) the
pattern subset problem in which one pattern is completely contained within
the other. It does this by using a special inhibitory input to the convergent subcircuit
node which tends to counteract the effects of larger patterns. Also unlike the original
Perceptron the Cognitron can discriminate to some degree between analog patterns
although binary patterns are usually presented to the first layer.
A basic
unit (section) of the Cognitron having two convergent subcircuits is shown
in figure 5. It has four input lines labeled A through D. Notice lines B
and C are common to both convergent subcircuits. It learns by increasing the
weights on the the active convergent subcircuit lines of the subcircuit selected
by the gate comparitor as having the greatest output. Like the original perceptron,
the best match is simply strengthened. The rule for adjusting each positive
line weight in a selected convergent subcircuit is:
Facilitory Weight Increment
(Proportionality Constant) * (Positive Line Value) / (Number of Subcircuit
Inputs).
In the Cognitron the weights can increase without limit but this
is balanced by increasing the weights on the inhibitory inputs at the same
time. The rule for adjusting the inhibitory weight is:
Inhibitory Weight Increment =
(Pattern Generality Constant) * [(Sum of all positive inputs into the subcircuit
node) / (Total Pattern Value)]
The Pattern Generality Constant in the
original paper was 1/2 and it is needed to help define the dynamic equilibrium of
the network between the positive and negative line values feeding into the subcircuit
node. This dynamic equilibrium in turn defines the degree of pattern discrimination
versus pattern generality.
Dynamic equilibrium effects are best seen
in the example shown in figure 5 which represents a Cognitron section at a
particular moment in time in which the positive weights have a value of 1 and the
top inhibitory weight has a value of 0.75 while the bottom subcircuit has
an inhibitory weight value of 0.6. These weights allow a subset pattern discrimination
(such as 1,1,1 verses 0,1,1). Yet this discrimination is only possible if
the bottom inhibitory weight has a value between 0.75 and 0.45. Any weight
value outside that range forces the two patterns to be classified as belonging
to the same general class. The choice of the Pattern Generality Constant is
what ever works for no analytical derivation as to its value for any degree
of generalization has yet been devised. Also one would think that it would be a
prime candidate to be adaptively determined itself but no method has yet been devised
for that either.
With such a narrow range for subset pattern discrimination
the number of input lines of a convergent subcircuit needs to be rather small
in order to preserve resolution. For example, the percentage difference between patterns
having 20 and 21 binary values is not as great as the difference between patterns
having 3 and 4 binary values. Consequently, Fukushima divided the Cognitron
into repeatable sections and to connect the sections he was forced to use
several layers. This use of multiple layers was to inspire other multilayered yet
quite different networks in the future (such as the hybrid network below).
Figure 6
The NeoCognitron's Position Independence Strategy
 |
In order to combat the ever increasing line values due to ever increasing
weight values Fukushima did not use simple summation and subtraction operations
for the convergent subcircuit node. Instead he combined the positive and subtractive
nodal inputs with a formula which slows the growth of the output value. The
exact equation is (e - h)/(1 + h) where e is the exitory or additive input
and h is the inhibitory or subtractive input.
The many layers and sections of the Cognitron allowed
it to be modified so that it could respond in the same way (having the same
final output) if the same object moved around in a visual field. This modification
was called the Neocognitron by Fukushima who published it in 1980. All that
was done was to add a final set of
summation nodes after a layer's gate comparitor which summed all the outputs
from all the convergent subcircuits in the same location of each section.
(see figure 6). If a feature pattern was moved it would be in the same location in some new
section as it had it had been in its old previously learned section.
Consequently it would activate the same final summation node as before to
effect position independence which is limited only by the degree of overlap
between sections (if the sections do not overlap very much then the pattern
would have a low probability of being in its exact relative location in the new
section).
Fukushima, K (1975). Cognitron: A Self-organizing Multilayered Neural
Network,
Biological Cybernetics, 20:121-136
Fukushima, K (1980). Neocognitron: A
Self-organizing Neural Network Model for a Mechanism of Pattern Recognition
Unaffected by Shift in Position,
Biological Cybernetics 36, 193-202
The Hopfield Association Network - Revival of the Reverberatory Networks (1982)
Figure 7
Hopfield Network Examples
 |
In 1982, John Hopfield revived interest in neural networks in the United States with
the introduction of a new type of reverberatory network of the association
type. It differed from
the earlier versions by using bi-directional lines (equivalent to two reciprocal
unidirectional lines) between summation nodes instead of unidirectional lines
and emphasized individual cells (nodes) instead of cell assemblies. The summation
nodes have a threshold of zero and only produce an output value of one if that threshold
is met or exceeded. The weight values between the nodes can be any number
between -1 and 1 so both exitory and inhibitory operations are represented. The
general learning idea behind the network is that the weights between the active
nodes (those producing an output of 1) will increment while those between
all other nodes will decrease. This is usually accomplished by normalizing
all the weights. In the original paper by Hopfield all the weights added up
to zero.
So in order to learn a correlation (association) between some input
and output pattern a separate training session must be held in which the input
pattern is presented along with the desired output pattern. In the original paper
the weights were assigned initial values at random. The weights between active
nodes are then incremented by some amount and the normalization process then
decrements the remaining weights. The process is repeated for all patterns
until the weights stop changing (or nearly so).
Figure 7 shows some of the
characteristics of the Hopfield association network. Both subset discrimination
and generalization are possible. A use for this generalization ability emphasized
by Hopfield is this network's use as a content addressable memory in which
the full memory (pattern) would be retrievable by providing only partial information.
As can be seen from the example this mostly amounts to creating a pathway
for the common elements in the input patterns. Yet not all associations are
possible. In practice the number of different associations which can be learned
and recalled is only about 15% of the number of summation nodes. This lack
of association formation is shown in the bottom example in figure 18. The
small four node network cannot learn the associations presented in the example.
The network needs to be larger or the normalization value must be larger than
zero. Yet increasing the normalization value will decrease the pattern discrimination
ability of the network.
As learning proceeds in a Hopfield network with initially
random weight values the weights will change in a manner so that fewer paths
will be active. So the fewer paths which are active the closer is the network
to fully learning its pattern set. This general tendency towards minimizing
active pathways can be measured by summing together the output values of all
the weights. Hopfield calls this the energy level of the network which is
minimized as the network learns.
While the Hopfield association network can discriminate
between almost any pattern given sufficient size the number of connections
increases exponentially with the size of the network meaning it will get very
complicated very fast. Also the requirements for normalization and binary
inputs are other limitations.
Hopfield, J.J. (1982). Neural Networks and Physical
Systems with Emergent Collective Computational Abilities,
Proceedings of the
National Academy of Sciences, 79:2554-2558
Reilly & Cooper's Hybrid Network - The First Hybrid Network (1982)
While the Hopfield network got most of
the press during the neural network revival another significant type of network,
called at this site a hybrid network, was presented by by Doug Reilly, Leon
Cooper and Charles Elbaum. This is a two layered network with each layer accomplishing
(although imperfectly) a different strategy (thus the hybrid name). The first
(input) layer of convergent subcircuits is responsible for generalization
while the second (output) layer of convergent subcircuits is responsible for
specification. This turns out to be an important and powerful pattern
classification strategy.
Figure 8
Two Layer Hybrid Network
 |
Figure 8 shows
the basic workings of a section Reilly and Cooper's hybrid network having
two pattern classification outputs and three pattern feature inputs. A network
would consist of many such sections. The top output line of this section is activated
for class "A" patterns while the bottom output line is activated for class
"B" patterns.
If a pattern is presented which does not produce an output then the
weights on the active lines of of the front layer convergent subcircuits are
incremented to some value greater than one. Since the threshold has a value
of one this insures that the pattern will produce an output from the front
layer. Next the weight on a branch of that output line (which is now active)
in the second layer which has been assigned to represent that class of patterns
is now incremented to one resulting in an output.
If a pattern (the blue
"A" class pattern) is then presented which is supposed to belong to the same
class as the red pattern yet it is so different as to not produce an output
then the above procedure is repeated. Another possibility is that a pattern
belonging to a different class (the blue "B" class pattern) will falsely produce
an output from the "A" class output. In such a case called confusion, the
front layer of the "A" class convergent subcircuit must be trimmed back by
reducing the weight values on the active lines until an output is no longer produced.
These learning procedure will work as long as no too much overlap occurs between
the presented patterns. The greater the pattern feature overlap the more front
layer convergent subcircuits must be used and each must have fewer input lines.
The main limitation of Reilly's hybrid network is that it has all the limitations
found in the Adaline type of convergent subcircuit with its multiplication factor
weights and summation node. Consequently, the input patterns need to be normalized
(adding up to 1 here). Its great strength is that non-linear patterns can
be classified and discriminated.
Reilly, Douglas L.; Cooper, Leon N.; Elbaum, Charles (1982). A Neural
Model for Category Learning,
Biological Cybernetics 45, 35-41
The Multilayered Back-Propagation Association Networks (1986)
Figure 9
A Back Propagation Convergent Subcircuit
 |
With multiple layered neural networks in the news the
question was what was the best way to extend the Widrow-Hoff (Delta) rule
to multiple layers. In 1986 three independent groups of researchers: 1) Y.
Le Cun 2) D. Parker 3) D. Rumelhart, G. Hinton, & R. Williams came up
with essentially the same idea which came to be called the Back Propagation
network for the way it distributes pattern recognition errors throughout the
network. Yet their ideas turned out to be the neural analog of the a steepest descent
algorithm discovered by Paul Werbos in 1974. (I have not seen this paper so I am
taking Stephen Grossberg's word on this).
The basic repeatable unit (convergent
subcircuit) used in the Back-Propagation network as described by Rumelhart,
Hinton, and Williams is shown in figure 9. The weights, represented by w(i,j),
can be any positive or negative value but they start out as small, randomly
chosen numbers (-0.3 and 0.3 were used). The squashing function (funct.) is
generally used to keep the output of the convergent circuit 1 or less. The
error is the difference between the desired output and the actual output.
The strategy
used in back-propagation is to determine and use the; error contribution made
by each layered pathway through the network. In a three layered network the
learning rule for each layer is (red indicates the additions relative to the
last layer):
Weight value change for last layer = (proportionality constant)(error)(change in
output value from last layer’s function)(pre-weight line value) (weight of last
layer)
Weight value change for middle layer = (proportionality constant)(error)(change
in output value from last layer’s function)(value change from middle layer function)(pre-weight
line value)(weight of last layer) (weight of middle layer)
Weight value change for front layer =
(proportionality constant)(error)(change in output value from last layer’s function)(value
change from middle layer function)(value change from first layer function)
(pre-weight line value)(weight of last layer) (weight of middle layer)(weight
of first layer)
Figure 10
Learning Rate Example
 |
Notice that the rule for changing the weights in the last layer
is the variation of the Widrow-Hoff (Delta) rule used first in their Adaline network but making use of the more accurate but time consuming post-weight line value
given by (pre-weight line value) (weight value). The other change is the incorporation
of the value change due to the function used to keep the line values between
layers less than or equal to 1. If the network is strictly connected layer
to layer without any convergent subcircuits bypassing a layer the function
component can be left out of the weight change rule for they would no longer
produce any possible difference in the error contribution.
So as one moves forward
in the network towards the inputs the weight change rules include all the
weight values of the more rearward layers in its error path. This produces
smaller weight changes nearer to the front. The assumption here is that the
weights near the back of the layer are more responsible for the error than those
near the front in accordance with The Fundamental Pattern Classification Strategy.
Whereas Hybrid networks use just two layers to accomplish this strategy the
Back-Propagation networks spread it over several. The result is slow learning
often needing as many as thousands of iterations to learn a set of patterns.
An example of how including weight values in the weight change rule is given
in figure 10. Weights less than 1 produce simply produce small weight changes.
Parker, D.B. (1986). A Comparison
of Algorithms for Neuron-Like Cells. In J.S. Denker (Ed.), Neural Networks
for Computing (pp 327-332)
New York: American Institute of Physics.
Rumelhart,
David D., Hinton Geoffrey E., and Williams, Ronald J. (1986). Learning Representations
by Back-Propagating Errors,
Nature 323:533-536
Werbos, Paul (1974) Beyond regression:
New Tools for Prediction and Analysis in the Behavioral Sciences.
Unpublished
doctoral thesis, Harvard University, Cambridge, MA
The First Regularity Detector
with Dynamic Allocation - The Adaptive Resonance (ART) Networks (1987)
Figure 11
The Problem of Dectecting a Pattern MisMatch
 |
The first generation of pattern regularity detectors for unsupervised learning were
based upon using convergent subcircuits having random input connections. Those input
connections which happened to match some presented pattern were then strengthened.
Those networks run into the combinatorial explosion problem as the number
and size of the presented patterns increased since the number of average random
connections had to increase even more.
The second generation of regularity detectors
was introduced in 1987 by Gail Carpenter and Stephen Grossberg of Boston University. Their network using binary input patterns was called ART 1 (1987a) while their
network using analog inputs was called ART 2 (1987b). They were the first networks
to use the convergent subcircuit dynamic allocation strategy which assigns
a new convergent subcircuit to the presented pattern if no other convergent
subcircuit has a sufficient match.
The problem here is determining
when a convergent subcircuit does not match its pattern. The easiest approach
is simply to assume that the greater the output of the subcircuit the better
the match so one can simply use a Gate Comparitor subcircuit to pick the greatest
output value. Yet this is not always the case as shown in the top subcircuit of
figure 11 in which its matching pattern (the superset pattern) produces a
lower output value than a mismatched subset pattern. Yet the bottom convergent
subcircuit does produce an output which is larger for its matching pattern
when the matching pattern is the subset pattern. If these two subcircuits
were in the same network and the superset pattern was presented, the bottom
subcircuit would still produce the greater output even though the pattern is
not matched to it! This competitive mismatch problem is inherent in all convergent subcircuits based on
summation nodes even when they use normalized inputs as is done in figure 11.
The competitive mismatch problem of summation node convergent subcircuits can
be mostly solved by using the positive feedback approach of the ART networks.
The positive feedback is used to enhance the superset signals in a process
called Adaptive Resonance, a term which was first coined by Stephen Grossburg
in 1976 as he investigated how stable reverberation circuits might be formed
(which he called a short term memory) so that weight changes could take place
(long term memory). The limitation of adaptive resonance is that during learning
the superset patterns must always be presented first so that the weights can
be set to their proper value for the resonance to work.
Figure 12
The Core of the ART 2 Network
 |
Figure 12 shows the core
circuitry of the ART 2 network (analog inputs). The F2 layer contains two
convergent subcircuits tuned to the same patterns which were used in figure 11 except here the weight values are 10 times the pattern values. At time = 0 and
in accordance with the competitive mismatch problem, the presented superset
pattern of 0.6 and 0.4 produces the largest output out of the bottom subset
subcircuit (tuned to the pattern 1, 0). This value is selected by a gate comparitor
subcircuit (which passes the largest value among its inputs), is then converted
to a constant "d" (here 0.9), and fed back to all the inputs of the F1 layer.
But before the feedback signal can pass through the network again the network
is reset by a large inhibition of the active node in the F2 layer.
This immediately
causes the gate comparitor to output the signal of the top convergent subcircuit
which manages to surge through the network before the reset circuit can react
again. So the workings of the ART 2 network require precise signal timing which
is why it is often described as a real time network. The addition of the feedback
signal now forces the F1 nodes to output their maximum value of one which
prevents any reset.
The key to making the ART 2 work is the reset rule. The simplest
and most intuitive reset rule (and one which allows superset patterns to be
presented first) would be to calculate the average of the non-zero inputs
into the F2 layer. Since these lines have a maximum value of 1 a perfect match
would have an average of 1. A vigilance parameter between 0 and 1 (0.9 used
in the example) can then be defined to allow less than perfect matches to
be averaged together by the convergent subcircuit (a vigilance parameter of 1
requires a perfect match by each convergent subcircuit). Yet this rule is an abrupt
and discontinuous rule which makes an all-or-nothing cut-off in not considering
0 valued input lines. What about lines having a 0.1 value? So
what ART 2 really does is use a directional average taken from vector mathematics
in the form U / ||U|| which relative to a baseline coordinate equals the cosine
of the angle between the vector U and its norm or length ||U||. But with this formula
the smaller line values actually have a greater impact so the vigilance parameter
is divided by the result so that the small values will not "weighted" more
than the larger values. In either case both rules would be difficult to accomplish
in real neural systems.
Learning must take place when no convergent subcircuit matches
some input pattern. When no match occurs the network cycles through several
resets testing the most likely convergent subcircuits. This resetting will
continue unless some other circuit eventually brings it to a halt allowing
the weights to change on the final selected convergent subcircuit. The extra
circuit in the ART 2 network is a slow positive feedback loop on the front
end which increases in value after a period of time and in so doing shuts down the
resetting. Repeated normalization's of the signals keep the values in bounds and
a three level circuit assures that when the input pattern is removed the positive
feedback signals stop as well.
Carpenter, G.A. & Grossberg, Stephen (1987a).
A Massively Parallel Architecture for a Self-Organizing Neural Pattern Recognition
Machine, Comput. Vision Graphics Image Processing 37:54
Carpenter, G.A. &
Grossberg, Stephen (1987b). ART 2: Self-Organization of Stable Category Recognition
codes for Analog Input Patterns,
Applied Optics 26:4919-4930
Grossberg, Stephen
(1976). Adaptive Pattern Classification and Universal Recoding: II Feedback,
Expectation, Olfaction, Illusions,
Biological Cybernetics 23:187-202
The First Multivalued Logic Neural Network (1990)
Given the limitations of summation based neural networks and the need for logic
like operations for higher level intelligence as shown by the artificial intelligence
field some sort of analog based logic network seemed worth investigating (the next
logical step :) )The first multivalued logic neural network
was a simple adaptive switching network meant to represent the reticular formation
of the brain which
was published (Olmsted, 1990). It was later incorporated as the second stage in
a very successful hybrid network (not published).
Olmsted, David (1990). The Reticular Formation as a Multi-Valued Logic Neural Network,
International
Joint Conference on Neural Networks - Vol. 1, pp 619 - 624, IEEE Neural Networks
Council