History and Principles of Neural Networks to 1960
by David D. Olmsted (Copyright - 1998, 1999, 2006. Free to use for personal and
educational purposes)
Last revised August 26, 2006
Introduction
This introduction presents the underlying principles of neural networks
so one can understand the strengths and weaknesses of each type and see how they relate to each other. Consequently, this introduction
is generally non-mathematical with an emphasis on network diagrams. What follows
are the significant neural network designs seeking to mimic how memory and
pattern recognition might work in the brain. Other neural network designs not covered here include those for classical
conditioning, motor control in the cerebellum, and rhythmic motor generation.
While I always recommend reading the original journal articles so one can read the
footnotes the following book has most (but not all) of the neural network
articles referenced in this paper:
Neurocomputing Foundations of Research (1988)
Edited by James A. Anderson and Edward Rosenfeld. MIT Press, Cambridge, Massachusetts
The First Neural Network Theory
(This section is mostly taken from Wilkes and Wade
(1997). Thanks to Prof. Nicholas Wade at the University of Dundee in Scotland
for acquainting me with Alexander Bain)
The first neural network was presented
by Alexander Bain (1818 - 1903) of the United Kingdom in his 1873 book entitled
"Mind and Body. The Theories of Their Relation". His work was no doubt inspired
by the then new findings of neuroanatomists who using the newly discovered
neural stains of carmine (by Gerlach in 1858), methylene blue (by Nissl in
1858), and hematoxylin (by Waldeyerin 1863 - stains axon fibers) were able
for the first time to see the true extent of the interlocking fibers of the
brain. The state of knowledge before these stains is given by the 1853 Manual
of Human Histology written by Rudolf Kolliker which states that there were various
forms of nerve cells and
"besides these there are a good many fine pale fibers,
like the processes of cells, only more extended of which nothing more can be said
as to whether they are nerve tubes or are to be referred to as the processes of
cells." (McHenry - 1969)
Bain first described memory as a set of nerve currents
weaker than that produced by the original stimulus.
"If we suppose the sound of a bell striking the ear, and then ceasing, there is a certain continuing
impression of a feebler kind, the idea or memory of the note of the bell;
and it would take some very good reason to deter us from the obvious inference
that the continuing impression is the persisting (although reduced) nerve
currents from the past - the remembrance of the former sound of the bell"
(1873, p 90 as reprinted in Wilkes and Wade - 1997).
The ability to recall a specific memory requires that an association (grouping) first be made with
some other memory, sensation, or motor action via some kind of neural growth.
"for every act of memory, every exercise of bodily aptitude, every habit recollection,
train of ideas, there is a specific grouping, or coordination, of sensations
and movements, by virtue of specific growths in cell junctions" (1873, p.
91 as reprinted in Wilkes and Wade - 1997)
Bain next applied these general ideas in a more concrete fashion suggesting an early form of threshold decision making as
shown in figure 1 where each line represents an idea
which sums (co-operates) with other ideas.:
Figure 1
Bain's Summation Threshold Network
 |
"If each set of sensory
fibres had one definite connection with motory or outcarrying fibres, we should
have always the same movement answering to the stimulation of the same nerves,
as in the reflex system; the fibre a could do nothing but affect the movement
x. It is necessary to the variety and flexibility of our acquirements, that
the fibre a should at one time take part in stimulating x, and at another
time take part in stimulating y, the circumstances being different. The stroke
of the clock will stimulate us at one time to set in one direction, and at another
time in another direction, according to the ideas that it co-operates with."
Yet Bain realized that if his theory were true every possible association
or grouping would have to be hardwired into the brain:
"...the number of fibres
and cells brought into action, before the grouping can converge upon some
one set of cells definitely connected with an out-going motor arrangement,
or with some other internal grouping, - must be very great indeed; and but
for the vast number of fibres and cells, demonstrably present in the& brain,
the separate embodiment of every separate impression and idea would seem impractical."
(1873, p. 113, as reprinted in Wilkes and Wade - 1997)
In an attempt to reduce the degree of hardwiring needed Bain suggested that a signal attenuation
strategy through the brain network would add greater flexibility. In
figure 2 each node has some resistance so signal strength is proportional to path
length.:
Figure 2
Bain's Signal Attenuation Proposal
 |
"... a more energetic current necessarily takes a more extended
sweep, and affects a number of cells and fibres that are left quiescent under
a feebler current. The cells being viewed as crossings - where a current in one
circuit induces a current in an adjoining circuit - there is, at each crossing,
a certain resistance to overcome, and the feebler current is sooner exhausted
and stops short of the distance reached by the stronger." (1873, pp. 113-114,
as reprinted in Wilkes and Wade - 1997)
" ... the degree of stimulation of
the same fibres will determine not merely a greater energy of the same response,
as would happen in reflex stimulation, but a totally different response: a,
weak, determines movement x; a, strong, determines y." (1873, p.109, as reprinted in Wilkes and Wade - 1997).
Notice that Bain does not explain how the the stronger signal does not simultaneously activate the weak
signal action. Bain's adaptive rule for associations (groupings) precedes that of
later, better known authors such as Hebb:
"We know what are the conditions of making an acquirement, or of fixing two or more things together in memory.
The separate impression must be made together, or flow in close succession;
and they must be held together for a certain length of time, either, either
on one occasion or on repeated occasions. Now to each impression, each sensation
or thought, there corresponds physically a group or series of nerve currents;
when two impressions concur, or closely succeed one another, the nerve currents;
find some bridge or place of continuity, better or worse, according to the
abundance of nerve matter available for the transition. In the cells or corpuscles
where the currents meet and join, there is, in consequence of the meeting,
a strengthened connexion or diminished obstruction - a preference track for
that line over lines where no continuity has been established. (1873, p. 117,
as reprinted in Wilkes and Wade - 1997)
Most of the ideas for Bain's book Mind and body were prepared 10 years earlier for three lectures
to the Philosophical Society in Aberdeen, Scotland. Yet the idea that most
of the brain had to be hardwired apparently did not appeal to most scientists.
Even his student, David Ferrier who published his own classical work The Functions
of the Brain (1876) never mentioned the details of Bain's work although the
work itself is mentioned. In the greatest irony of all even Bain came to doubt his
own idea's:
"The hypothesis was a legitimate one; but subsequent reflection
led to the belief that the number of psychical elements, although run up to
hundreds of thousands, was still inadequate." (1904, p 313).
Not until 1943 with McCulloch and Pitts at the dawn of computer age were brain theorists of equivalent
caliber to be found who went beyond language based general descriptions
to actual network drawings.
Bain, A. (1873). Mind and body. The theories of their relation. London: Henry King.
Bain, A. (1904). Autobiography. London: Longmans, Green
Ferrier, D. (1876). The Functions of the Brain. London: Smith, Elder and Co.
McHenry, Lawrence, C. (1969). Garrison's History of Neurology. Springfield, IL:Charles C. Thomas
Wilkes, Alan L. and Wade, Nicholas, J. (1997). Bain on Neural Networks. Brain and Cognition 33:295-305
Restatement of Bain's Adaptive Rule by William James (1890)
In 1890 American psychologist William James
restated Bain's adaptive rule without apparently knowing of Bain. William James proposed that all thoughts and bodily
actions were produced as a result of neural currents
flowing from regions (brain-processes) having an excess of electrical charge
to regions having a deficit of electrical charge. The intensity of the thoughts
and actions were proportional to the current flow rate which in turn was proportional
to the difference in charge between the two regions. Later these reverberating
electrical currents would be called engrams. Learning thus consisted of changing
these current paths or forming new paths by using the following rule:
"When two elementary brain-processes have been active together or in immediate succession,
one of them, on reoccurring, tends to propagate its excitement into the other."
(James 1890, p566)
Such learning improves with repetition:
"... nerve currents propagate
themselves easiest through those tracts of conduction which have been already
most in use." (James 1890, p563)
This also implies that paths not used will tend
to close producing forgetting.
Actual characterization of these nerve currents as a burst of action
potentials traveling in one direction down a single celled neuron was only established
between 1890 and 1910. This quickly lead to the standard neuron model
in which a neuron
sums all its facilitory and inhibitory inputs
from other connecting neurons and
if the neural charge exceeds a threshold then an action potential
is produced. The greater some sensory stimulus intensity the greater is the
frequency of the action potentials with a typical maximum frequency typically
being about 100 pulses per second.
According to the theory of William James a neuron's
transmission efficiency between stimulus and its output should increase with
use as the skill is learned but the first experimental test found just the
opposite. In 1898 England's Charles Sherrington found that spinal neurons
in cats reduce their efficiency with low intensity use instead of increasing
their efficiency as expected. He had found a phenomena known today as habituation.
James, William (1890), Principles of Psychology, Henry Holt, New York
Sherrington, C.S. (1898) Experiments in Examination of the Peripheral Distribution of the
Fibers of the Posterior Roots of some Spinal Nerves, Proc. of the Royal Society
of London, 190:45-186
First Neural Logic (1938 to 1943)
Figure 3
Rashevsky's EXCLUSIVE OR
 |
Yet no theory arose to challenge the holistic, reverberatory scheme of William James until 1938
when N. Rashevsky proposed that the brain could be organized around binary
logic operations since action potentials could be viewed as binary 1 (true)
values. He even presented the circuit shown in figure 3 showing how a binary
logic EXCLUSIVE OR operation could be implemented using addition and subtraction
operations. He did not formulate the other logical operations in analog terms so his formulation
remained incomplete.
In 1943 Warren McCulloch and Walter Pitts realized that
the natural consequence of the standard neuron model's threshold in combination
with binary action potentials produced another type of logic called threshold
logic. Since each action potential pulse is an all or nothing binary event
a threshold value of 2 defines an AND operation as shown in figure 4. Likewise,
a threshold value of 1 defines an INCLUSIVE OR operation since only one action
potential on either line is sufficient to produce an output. They also described
the use of the subtraction operation in their paper but never explicitly associated
it with the logical CONDITIONAL operation.
Figure 4
Threshold Logic AND
 |
Threshold logic was expanded by
adding more input lines so that the output of the summation node would become analog
since this seemed to fit in more with the standard neuron model. Yet by using more
than two inputs the operations ceased to be logical, that is, they ceased
to define the basic set of state symmetry testers used as building blocks in
larger circuits.
Instead these operations became parameter classifiers which classified their
input patterns based upon the one parameter of signal magnitude given by the
summation of the binary inputs.
The very existence of an analog value for the first
time allowed these operations to become adaptive using multiplication factors called
weights. Exactly how that adaptability was implemented defines the many different
neural networks described below. Yet modifications to a single parameter can never
fully characterize the
state (the pattern or distribution) of a system's input.
Some ambiguity is always the result yet at the time this ambiguity was hailed
as a great achievement of neural networks for many saw it as meaning that
neural networks could work with partial knowledge even though this ambiguity
could never be precisely defined and controlled.
So
the achievement of adaptability in the neural networks of this time came at
the cost of loosing the decision making resolution of logic. By 1963 this
split brain
researchers into the separate groupings of neural network
researchers and artificial intelligence researchers. The 1963 paper by R.O. Winder
seems to have been the last major paper which still attempted to combine the weight
based and mathematical neural network approach with the logical artificial
intelligence approach but its approach was to use mathematical techniques
to find the desired weights of a network instead of having the network learn them. The result
was that the neural network researchers continued to develop parameter based
classifier circuits while the artificial intelligence researchers expanded
and abstracted the binary logic approach to form propositional and prepositional
logics which eventually resulted in the development of the programming language
called LISP. Because of the precision of logic the field of
artificial intelligence enjoyed great initial success. Yet its high level algorithmic
and sequential approach eventually proved to have serious limitations in terms of adaptability
and the handling of uncertainty.
McCulloch, W.S. & Pitts, W.H. (1943). A Logical Calculus of
the Ideas Immanent in Nervous Activity, Bulletin of Mathematical Biophysics, 5:115-137
Rashevsky, N (1938). Mathematical Biophysics, University of Chicago Press, Chicago, IL
Winder, R.O. (1964). Threshold Logic in Artificial Intelligence, Artificial
Intelligence, IEEE Publication S-142, pages 107-128.
(Rashevsky, McCulloch, and
Pitts also played a role in the development of systems science. See chapter
2 of the Macroscope by Joel de Rosnay)
The First Randomly Connected Reverberatory Networks (1954)
Neural network research really only became possible at the dawn
of the computer age when ideas could be validated by simulation on various types
of electronic calculators. The impetus for these simulations was provided
by Donald Hebb of McGill University in Canada who in 1949 proposed this unidirectional
variation of the Bain / James learning rule:
"Let us assume then that the
persistence of repetition of a reverberatory activity (or trace) tends to
induce lasting cellular changes that add to its stability. The assumption
can be precisely stated as follows: When an axon of cell A is near enough
to excite cell B and repeatedly or persistently takes part in firing it, some
growth process or metabolic change takes place on one or both cells so that A's
efficiency as one of the cells firing B is increased" (Hebb 1949, p62)
Figure 5
The Farley and Clark Network
 |
The first such Hebbian inspired network was simulated by Farley and Clark in 1954
on an early digital computer at M.I.T. Their network shown in figure 5 consisted
of nodes representing neurons randomly connected with each other by unidirectional
lines having multiplication factors (weights). Each node summed
its inputs and produced an output of one if the sum of its inputs exceeded its threshold
value. In order to recognize patterns the network was divided into quarters with
the inputs from one pattern entering the front top quarter while the inputs
from the other pattern entered the front bottom quarter. Only one pattern
was presented at a time so each tended to produce its own unique flow path
through the network. A pattern was considered to be recognized if it consistently
produced more activity in one of the back network quarters than in the other.
Yet in order to get this network to work Farley and Clark had to modify Hebb's
learning rule. This new rule required that the activity level of each network
line be examined at each instant in time so that if the line value changed in the
desired direction
then all of its nodal input weights would be incremented (increased).
If the change was in the wrong direction then the nodal input weights would
be decremented (decreased). With this new directional rule the network was able to successfully
discriminate between two widely differing patterns as long as they were presented
alternately (see Farley, 1960). The major problem with this type of network
is its lack of pattern discrimination resolution combined with the high number
of neurons needed to make that discrimination. Also the cells in the network
were not self-assembled as was thought to be necessary at the time but were
assigned to assemblies (the quadrants) by the experimenters.
Hebb, D.O. (1949). The Organization of Behavior, John Wiley & Sons, New York
Farley, B., and Clark, W.A. (1954). "Simulation of self-Organizing Systems by Digital Computer",
IRE Transactions on Information Theory 4:76-84
Farley, B.G. (1960) Self-Organizing Models for Learned Perception; in M.C. Yovits and S. Cameron (editors), Self-Organizing
Systems, Pergamon Press, Oxford
The First Reverbatory Network Showing Self-Assembly 1956)
Figure 6
One Cell of the Self-Assembly Network
 |
The next step was taken in 1956 by N. Rochester and friends using an
early IBM Type 701 (2 Kbytes memory) and 704 Electronic Calculator (as computers
were often called then) located at the IBM research labs in Poughkeepsie,
New York. Instead of connecting the cells randomly overall as did Farley
and Clark they organized their cells into a single layer and then randomly
connected the cell's outputs back to other cell's inputs (figure 6). In accordance
with Hebb's learning rule the weights were increased with use and they had
this to say about it:
"No process of just this sort has been observed in living
tissue. However, it has not been possible to demonstrate, by measurement, that the
Hebb postulate is false. Nothing else has been observed that could account
for learning and memory in a plausible way."
Despite this they were forced
to modify Hebb's rule by normalizing all the weight values so that they always
added up to some constant value. The reason was that with use all the weights
would eventually increase to their maximum value. What was desired was that
the more heavily used weights be larger than the lessor used weights so if
any weight is increased the other weights must be decreased by an equivalent overall
amount. (Significantly, they did not discuss how this process could be accomplished
neurally).
They also tried neural habituation in their network which occured when a rapidly firing neuron increased it threshold
so as to become less responsive. The result was a network which did not form
any assemblies but which did show a continuing wave-like reverberation after
the inputs ceased.
Meanwhile, Peter Milner, an associate of Donald Hebb, suggested
that cell assemblies could form if most of the synapses within a cell assembly were
exitory (positive) while those between cell assemblies were inhibitory (subtractive).
So they changed the network weight ranges from 0 to 256 to -1 to 1. In order
to speed up the simulation they ceased using binary pulses on the network
lines and instead had the lines represent analog firing frequencies which in their
model ranged from 0 to 15. They called this the F.M. model. The result was
that cell assemblies built up around the input regions but only around those
regions. They did not appear any where else in the network. In response to this
partial success they said the following:
"This kind of investigation cannot prove
how the brain works. It can, however, show that some models are unworkable
and provide clues as to how to revise the models to make them work."
Rochester, N., Holland, J.H., Haibt L.H. and Duda, W.L. (1956). Tests on a Cell Assembly
Theory of the Action of the Brain Using a Large Digital Computer,
IRE Transaction
of Information Theory IT-2:80-93
The First Pattern Regularity Detector - The Original Perceptron (1958)
Random neural networks were not having much success yet
many researchers were stuck on the idea that the neural connections in the
brain were mostly random. So a slight compromise was made in 1958 by Frank
Rosenblatt who initiated a new phase in neural network research by abandoning
the idea of self-forming cell assemblies. Instead memory was simply a change
in the relation between some input and some output which changed with regular
use. This meant that those patterns which regularly and consistently occurred
would be learned. This type of network for unsupervised learning is called
an auto-learning network or a pattern regularity detector. Consequently, the
randomness of neural connections was not overall but only occurred between differing
layers of neural cells. On page 388 of his 1958 paper he states the number one assumption
of the perceptron's design as:
"The physical connections of the nervous system which
are involved in learning and recognition are not identical from one organism
to another. At birth, the construction of the most important networks is largely
random, subject to a minimum number of genetic constraints."
The Original Perceptron was presented to answer the last two of these three fundamental
questions about the brain (page 386):
- How is information about the physical world
sensed, or detected, by the biological system?
- In what form is information
stored, or remembered?
- How does information contained in storage, or in memory,
influence recognition and behavior?
The simplicity and random connections of the
Original Perceptron and the later Perceptrons made them a fascinating subject
for mathematical analysis using probability theory. Rosenblatt's preference
for this approach as opposed to the developing algorithmic approach of artificial
intelligence is shown by this passage in his 1960 paper (page 301):
" Simulation
should not, in general, be attempted without a theoretical analysis of the
nerve net in question, sufficient to indicate questions of theoretical interest.
The examination of arbitrary networks in the hope that they will yield something
interesting, or the simulation of networks which have been specially designed
to compute a particular function by a definite algorithmic procedure seem to be
about equally lacking in value."
Figure 7
The Original Perceptron - Positive Inputs
 |
The original Perceptron circuit shown in figure
7 consisted of many convergent type subcircuits feeding into a decision making element which only passed its greatest
valued input (gate comparitor).
A convergent subcircuit is characterized by several input lines feeding into
some central operation or series of operations. In this case the central operation
is a summation node connected to a threshold unit which in turn connects to
a multiplication factor (weight). Rosenblatt called these subcircuits "A"
units for association units. Each received a certain number of binary (0 or 1) positive
inputs and a lessor number of negative (subtractive) inputs connected in a random
manner. If the sum of these inputs exceeded the threshold value then the threshold
would produce an output value of 1. This value was then modified by the weight
valued between 0 and 1 which Rosenblatt called the value of the "A" unit.
The output of each convergent subcircuit was next sent to the gate comparitor which selected its largest input value,
passed it on and fed it back to its convergent
subcircuit's ("A" unit's) weight to increment it by a
certain amount. In this way the most used template would would increase its transmission
efficiency in conformity to the idea's of Donald Hebb mentioned above. Any neural
network using a gate comparitor is known as a competitive network because
the different convergent subcircuits compete to get their signal selected.
The irony here is that a gate comparitor seems to be impossible to form beyond
the two input, two output level using only the mathematical operations of
addition and subtraction (in such a case the circuit is equivalent to the
EXCLUSIVE OR circuit
of figure 1 without the final summation node). In
all the competitive type of neural networks this function is either left vague of
is simply stated as a mathematical function. In Rosenblatt's paper the gate comparitor function is simply discussed at the two input, two output level.
This should have been a clue that something more besides addition and subtraction operations were required in neural networks
(gate comparitors of any size are easy to construct
using multivalued logic operations).
Because the inputs to the convergent subcircuits
are randomly connected every subcircuit is already somewhat pre-tuned to respond
to some pattern. The only purpose of the multiplication factors (weights)
is to insure the correct convergent subcircuit selection by the gate comparitor
in situations where the positive inputs from two or more patterns overlap
as shown in figure 7. In such a case both patterns will exceed the threshold
value leaving weight value to make the final discrimination.
Figure 8
The Original Perceptron - Positive and Negative Inputs for Pattern Subset Discrimination
 |
The major test
of any neural network's ability to classify patterns is the subset test in
which the neural network must discriminate a pattern which is smaller and
completely contained within another larger pattern. A variation of this test
is the so called "exclusive or" test in which two different smaller patterns
are contained within the larger pattern and all three must be discriminated. The
Original Perceptron is not always able to pass the subset test if it only uses positive
inputs (figure 7) yet it can if negative inputs are used (figure 8). The ability
of the all positive input Original Perceptrons to discriminate between a full
pattern and its subset (partial) pattern depends on the order of pattern presentation
and on luck. If the partial match pattern is presented first then both the
top and bottom convergent
subcircuits of figure 7 would respond equally given
equal initial weights. Yet in true
marketing fashion, this lack of discrimination ability was touted as the significant
brain-like characteristic of generalization in which the pattern does not
have to be an exact match in order for it to be classified! Still, the lack
of control over this process is a weakness of the original positive input only perceptron.
This subset pattern discrimination weakness in the all positive input Original
Perceptron can be remedied to a certain degree if the total multiplication
factor values (weights) of the network remain constant (if they add up to one this
requirement is called normalization). Thus whenever a weight is incremented a certain
amount all the other weights together must be decremented (reduced) the same overall amount.
This biases the response towards the more recently presented and larger patterns.
Thus in figure 7 if the the full pattern for the top subcircuit has been presented
most recently then the presentation of the partial pattern will also activate
the top subcircuit. Instead, if the full pattern for the middle subcircuit
has been presented most recently then the presentation of the partial pattern
will activate the middle subcircuit. While this is an interesting phenomena weight
normalization does not seem to be something that real neuronal circuits could accomplish.
The use of negative inputs as shown in figure 8 partly overcomes the subset pattern
discrimination problem by having larger patterns inhibit the smaller subset patterns.
A template
is a spatial filter, passing only that pattern (or image) which exactly matches its
positive inputs. If any part of that pattern exceeds (overlaps) a template's
border
then negative inputs will inhibit the activation of the convergent subcircuit.
A partial template leaves part of the possible pattern border undefined (no inhibition).
Another
way to characterize templates with border inhibition is by hardness and softness. A hard template
will completely inhibit the convergent subcircuit if any part of the pattern overlaps
the negative inputs. A soft template requires a large degree of overlap onto
some inhibitory region in order to suppress the convergent subcircuit. The network
template matching function has an algorithmic equivalent called convolution.
In the Original Perceptron the negative inputs were just put in at random
in the hope that they might accomplish the partial template function for some
presented set of patterns.
In the Original Perceptron overlapping patterns could
also have been discriminated by adaptively changing the values of the thresholds
but this does not seem to have been investigated. For example, a threshold
value of 3 can easily discriminate between a pattern having 3 positive binary
inputs and one having 2 positive binary inputs.
Rosenblatt, F. (1958). The Perceptron: A Probabilistic
Model for Information Storage and Organization in the Brain, Psychological
Review, 65:386-408
Rosenblatt, F (1960). Perceptron Simulation Experiments, Proceedings
of the IRE, 48:301-309
Neural Pattern Classification Arrives - ADALINE (1960)
ADALINE is an acronym for ADAptive LINear element. For the first time a convergent
type subcircuit having weights before the summation node is used to formally
classify patterns. Adaline was developed by Bernard Widrow and Marcian Hoff for
the purpose of recognizing binary patterns so that it could predict the next bit
in a flowing stream of bits. As such it was based on the ideas of R.L. Mattson,
then a student at MIT. Consequently, the paper Widrow and Hoff published is
an engineering paper concerned with computing which makes no mention of the
brain or brain memory. Perhaps because of this Adalines do not use a
threshold.
Figure 9
Adaline - No Convergence of Weights to a Stable Value
 |
Adalines are based on the use of an attractive (as opposed to repulsive)
goal seeking learning procedure in which a convergent subcircuit output value
is defined as the goal for each pattern. Consequently, the subcircuit must
learn the proper weight values to produce that goal value for any given set of input
patterns. This is in contrast to the later perceptrons which were repulsive
driven meaning that a specified goal value is not defined for each subcircuit, only
misclassification errors. Even
less complete error information leads to reinforcement
learning which only provides the information that an error or success has
occurred without the magnitude or direction (type) of that error. Positive
reinforcement is the attractive form while negative reinforcement is the repulsive
form.
Figure 9 shows an adaline using a learning procedure which does not
work while figure 10 shows a procedure which does work. The patterns chosen
for this example again demonstrate the subset problem in which one pattern is completely
contained within another pattern. This type of discrimination is the most difficult
for convergent type sub-circuits to discriminate.
Learning begins in figure
9 by first choosing a goal value which the pattern presented to the Adaline
is supposed to match. In Widrow and Hoff's paper these were 1, 0, or -1. Pattern
One is supposed to produce a 1. It is presented at time = 1 and the error
between the goal value and the actual Adaline value is noted. The error value
is then divided among the weights which thus eliminates that error. Next Pattern
Two is presented which is supposed to produce a 0 and the weights are adjusted
again to eliminate the error. Next pattern 1 is presented again at time = 3 and
it produces an error of 0.2 which will change the weights back to what they were
originally with the result that the network is not learning the patterns. Yet this
method will work sometimes and this is the procedure used in Widrow and Hoff's
first paper.
A more successful procedure was found by Widrow in 1962 and it
is called the Widrow-Hoff learning rule or the Delta learning rule. It is
based on the realization that the greatest sources of the error are from the active
lines. Consequently, the Widrow-Hoff learning rule changes the value of each weight
in proportion to its pre-weight line value (in this case 1 or 0) according
to the following rule:
Weight Change = (Pre-Weight Line Value) * (Error /
(Number of Inputs) ).
Figure 10
Adaline - With Convergence
 |
While this is the official Delta Rule notice that it
does not conserve the error value in accordance with Widrow and Hoff's first paper. With the error value per
weight divided by the number of inputs not all of it may be used. (although repeated presentations
will gradually reduce this error). Yet if the error is conserved as is done
in figure 10 so that all of it is distributed to the weights then the error
is totally eliminated (see time = 2 of figure 10). Using error
conservation increases the rate of learning.
Instead of just using the pre-weight
line value to indicate the greatest source of error a variation would be to
include the weight values as well since the post-weight line value give a
more accurate measure of error assignment. The trade-off is a slightly more
complex learning rule and slower learning if the error value is not conserved
since smaller numbers are the result. In fact, this variation is the rule
used in the Back Propagation networks discussed below. (This is part of a larger
problem in neural network research known as the credit assignment problem. As networks
get more complex determining the source of any resultant error also becomes
more complex.)
What happens with the Widrow-Hoff procedure is that all those
input lines representing features which are common to the patterns will tend to
zero. This leaves the non-overlapping pattern features to define the convergent
subcircuit output value. Consequently, only one ADALINE is needed for any
set of patterns making it a compact solution for a pattern set having patterns
with some unique feature. Yet not all pattern pattern sets exhibit this property.
Consider the set having patterns: 111, 011, and 110 (the "exclusive or" problem
again). No one pattern has a unique feature which means the ADALINE will never
converge to zero error. In these situations a modification of the Widrow-Hoff
procedure is used called the Relaxation Procedure in which the input values
used to modify the weights are normalized (that is add up to some constant
number, usually one). This insures that large patterns do not overly bias
the learning. (for a mathematical description of various other less used learning
procedures see chapter 5 in the text by Duda and Hart).
Figure 11
Linear Discrimination from Convergent Type Subcircuits
 |
ADALINES do not need to use
binary input patterns but can also use analog inputs. Then the goal value for each input pattern represents a constant
valued line relative to the inputs as shown in figure 11. The learning procedure then shifts the
line (yellow) so that it tends to connect the two goal lines. This shows that
weight and summation node convergent subcircuits always produce straight lines,
in other words they are linear. Any analog pattern must lie on a straight line in
order for it to be completely learned by any ADALINE with zero error convergence.
This is an important limitation of these kinds of circuits. Not realized at
the time was the significance of the error value used in the ADALINEs. Logical
negation of error values produces multivalued and fuzzy logic certainty values
which allows downstream circuits to work with a "degree of matching" signal.
Duda, R.O. Hart, P.E. (1973). Pattern Classification and Scene Analysis,
John Wiley & Sons, New York
Mattson, R.L. (1959). "The Design and analysis
of an Adaptive System for Statistical Classification", S.M. Thesis, MIT May 22,
1959
Mattson, R.L. (1959). "A Self-Organizing Logical System", Eastern Joint
Computer Conference Record, I.R.E., N.Y.
Widrow, Bernard, and Hoff, Marcian, E.
(1960). Adaptive Switching Circuits, 1960 IRE WESCON Convention Record, New
York: IRE pp. 96-104
Widrow, Bernard (1962) Generalization and information
Storage in Networks of Adaline Neurons. In M.C. Yovits, G.T. Jacobi, &
G.D. Goldstein (Eds.), Self-Organizing Systems. Washington D.C.: Spartan Books.