Simple Brain Simulation Resources 
Home > Theoretical Background > Neural Networks to 1960

DanceFloor by dynamic artist Jenny James. Copyright 2006 (used with permission)

History and Principles of Neural Networks to 1960

by David D. Olmsted (Copyright - 1998, 1999, 2006. Free to use for personal and educational purposes)
Last revised August 26, 2006

Introduction

This introduction presents the underlying principles of neural networks so one can understand the strengths and weaknesses of each type and see how they relate to each other. Consequently, this  introduction is generally non-mathematical with an emphasis on network diagrams. What follows are the significant neural network designs seeking to mimic how memory and pattern recognition might work in the brain. Other neural network designs not covered here include those for classical  conditioning, motor control in the cerebellum, and rhythmic motor generation.

While I always recommend reading the original journal articles so one can read the footnotes the following book has most (but not all) of the neural network articles referenced in this paper:

Neurocomputing Foundations of Research (1988) Edited by James A. Anderson and Edward Rosenfeld. MIT Press, Cambridge, Massachusetts

The First Neural Network Theory

(This section is mostly taken from Wilkes and Wade (1997). Thanks to Prof. Nicholas Wade at the University of Dundee in Scotland for acquainting me with Alexander Bain)

The first neural network was presented by Alexander Bain (1818 - 1903) of the United Kingdom in his 1873 book entitled "Mind and Body. The Theories of Their Relation". His work was no doubt inspired by the then new findings of neuroanatomists who using the newly discovered neural stains of carmine (by Gerlach in 1858), methylene blue (by Nissl in 1858), and hematoxylin (by Waldeyerin 1863 - stains axon fibers) were able for the first time to see the true extent of the interlocking fibers of the brain. The state of knowledge before these stains is given by the 1853 Manual of Human Histology written by Rudolf Kolliker which states that there were various forms of nerve cells and

"besides these there are a good many fine pale fibers, like the processes of cells, only more extended of which nothing more can be said as to whether they are nerve tubes or are to be referred to as the processes of cells." (McHenry - 1969)

Bain first described memory as a set of nerve currents weaker than that produced by the original stimulus.

"If we suppose the sound of a bell striking the ear, and then ceasing, there is a certain continuing impression of a feebler kind, the idea or memory of the note of the bell; and it would take some very good reason to deter us from the obvious inference that the continuing impression is the persisting (although reduced) nerve currents from the past - the remembrance of the former sound of the bell" (1873, p 90 as reprinted in Wilkes and Wade - 1997).

The ability to recall a specific memory requires that an association (grouping) first be made with some other memory, sensation, or motor action via some kind of neural growth.

"for every act of memory, every exercise of bodily aptitude, every habit recollection, train of ideas, there is a specific grouping, or coordination, of sensations and movements, by virtue of specific growths in cell junctions" (1873, p. 91 as reprinted in Wilkes and Wade - 1997)

Bain next applied these general ideas in a more concrete fashion suggesting an early form of threshold decision making as shown in figure 1 where each line represents an idea which sums (co-operates) with other ideas.:

Figure 1
Bain's Summation Threshold Network
"If each set of sensory fibres had one definite connection with motory or outcarrying fibres, we should have always the same movement answering to the stimulation of the same nerves, as in the reflex system; the fibre a could do nothing but affect the movement x. It is necessary to the variety and flexibility of our acquirements, that the fibre a should at one time take part in stimulating x, and at another time take part in stimulating y, the circumstances being different. The stroke of the clock will stimulate us at one time to set in one direction, and at another time in another direction, according to the ideas that it co-operates with."

Yet Bain realized that if his theory were true every possible association or grouping would have to be hardwired into the brain:

"...the number of fibres and cells brought into action, before the grouping can converge upon some one set of cells definitely connected with an out-going motor arrangement, or with some other internal grouping, - must be very great indeed; and but for the vast number of fibres and cells, demonstrably present in the& brain, the separate embodiment of every separate impression and idea would seem impractical." (1873, p. 113, as reprinted in Wilkes and Wade - 1997)

In an attempt to reduce the degree of hardwiring needed Bain suggested that a signal attenuation strategy through the brain network would add greater flexibility. In figure 2 each node has some resistance so signal strength is proportional to path length.:

Figure 2
Bain's Signal Attenuation Proposal
"... a more energetic current necessarily takes a more extended sweep, and affects a number of cells and fibres that are left quiescent under a feebler current. The cells being viewed as crossings - where a current in one circuit induces a current in an adjoining circuit - there is, at each crossing, a certain resistance to overcome, and the feebler current is sooner exhausted and stops short of the distance reached by the stronger." (1873, pp. 113-114, as reprinted in Wilkes and Wade - 1997)
" ... the degree of stimulation of the same fibres will determine not merely a greater energy of the same response, as would happen in reflex stimulation, but a totally different response: a, weak, determines movement x; a, strong, determines y." (1873, p.109, as reprinted in Wilkes and Wade - 1997).

Notice that Bain does not explain how the the stronger signal does not simultaneously activate the weak signal action. Bain's adaptive rule for associations (groupings) precedes that of later, better known authors such as Hebb:

"We know what are the conditions of making an acquirement, or of fixing two or more things together in memory. The separate impression must be made together, or flow in close succession; and they must be held together for a certain length of time, either, either on one occasion or on repeated occasions. Now to each impression, each sensation or thought, there corresponds physically a group or series of nerve currents; when two impressions concur, or closely succeed one another, the nerve currents; find some bridge or place of continuity, better or worse, according to the abundance of nerve matter available for the transition. In the cells or corpuscles where the currents meet and join, there is, in consequence of the meeting, a strengthened connexion or diminished obstruction - a preference track for that line over lines where no continuity has been established. (1873, p. 117, as reprinted in Wilkes and Wade - 1997)

Most of the ideas for Bain's book Mind and body were prepared 10 years earlier for three lectures to the Philosophical Society in Aberdeen, Scotland. Yet the idea that most of the brain had to be hardwired apparently did not appeal to most scientists. Even his student, David Ferrier who published his own classical work The Functions of the Brain (1876) never mentioned the details of Bain's work although the work itself is mentioned. In the greatest irony of all even Bain came to doubt his own idea's:

"The hypothesis was a legitimate one; but subsequent reflection led to the belief that the number of psychical elements, although run up to hundreds of thousands, was still inadequate." (1904, p 313).

Not until 1943 with McCulloch and Pitts at the dawn of computer age were brain theorists of equivalent caliber to be found who went beyond language based general descriptions to actual network drawings.

Bain, A. (1873). Mind and body. The theories of their relation. London: Henry King.
Bain, A. (1904). Autobiography. London: Longmans, Green
Ferrier, D. (1876). The Functions of the Brain. London: Smith, Elder and Co.
McHenry, Lawrence, C. (1969). Garrison's History of Neurology. Springfield, IL:Charles C. Thomas
Wilkes, Alan L. and Wade, Nicholas, J. (1997). Bain on Neural Networks. Brain and Cognition 33:295-305

Restatement of Bain's Adaptive Rule by William James (1890)

In 1890 American psychologist William James restated Bain's adaptive rule without apparently knowing of Bain. William James proposed that all thoughts and bodily actions were produced as a result of neural currents flowing from regions (brain-processes) having an excess of electrical charge to regions having a deficit of electrical charge. The intensity of the thoughts and actions were proportional to the current flow rate which in turn was proportional to the difference in charge between the two regions. Later these reverberating electrical currents would be called engrams. Learning thus consisted of changing these current paths or forming new paths by using the following rule:

"When two elementary brain-processes have been active together or in immediate succession, one of them, on reoccurring, tends to propagate its excitement into the other." (James 1890, p566)

Such learning improves with repetition:

"... nerve currents propagate themselves easiest through those tracts of conduction which have been already most in use." (James 1890, p563)

This also implies that paths not used will tend to close producing forgetting.

Actual characterization of these nerve currents as a burst of action potentials traveling in one direction down a single celled neuron was only established between 1890 and 1910. This quickly lead to the standard neuron model in which a neuron sums all its facilitory and inhibitory inputs from other connecting neurons and if the neural charge exceeds a threshold then an action potential is produced. The greater some sensory stimulus intensity the greater is the frequency of the action potentials with a typical maximum frequency typically being about 100 pulses per second.

According to the theory of William James a neuron's transmission efficiency between stimulus and its output should increase with use as the skill is learned but the first experimental test found just the opposite. In 1898 England's Charles Sherrington found that spinal neurons in cats reduce their efficiency with low intensity use instead of increasing their efficiency as expected. He had found a phenomena known today as habituation.

James, William (1890), Principles of Psychology, Henry Holt, New York
Sherrington, C.S. (1898) Experiments in Examination of the Peripheral Distribution of the Fibers of the Posterior Roots of some Spinal Nerves, Proc. of the Royal Society of London, 190:45-186

First Neural Logic (1938 to 1943)

Figure 3
Rashevsky's EXCLUSIVE OR

Yet no theory arose to challenge the holistic, reverberatory scheme of William James until 1938 when N. Rashevsky proposed that the brain could be organized around binary logic operations since action potentials could be viewed as binary 1 (true) values. He even presented the circuit shown in figure 3 showing how a binary logic EXCLUSIVE OR operation could be implemented using addition and subtraction operations. He did not formulate the other logical operations in analog terms so his formulation remained incomplete.

In 1943 Warren McCulloch and Walter Pitts realized that the natural consequence of the standard neuron model's threshold in combination with binary action potentials produced another type of logic called threshold logic. Since each action potential pulse is an all or nothing binary event a threshold value of 2 defines an AND operation as shown in figure 4. Likewise, a threshold value of 1 defines an INCLUSIVE OR operation since only one action potential on either line is sufficient to produce an output. They also described the use of the subtraction operation in their paper but never explicitly associated it with the logical CONDITIONAL operation.

Figure 4
Threshold Logic AND

Threshold logic was expanded by adding more input lines so that the output of the summation node would become analog since this seemed to fit in more with the standard neuron model. Yet by using more than two inputs the operations ceased to be logical, that is, they ceased to define the basic set of state symmetry testers used as building blocks in larger circuits. Instead these operations became parameter classifiers which classified their input patterns based upon the one parameter of signal magnitude given by the summation of the binary inputs.

The very existence of an analog value for the first time allowed these operations to become adaptive using multiplication factors called weights. Exactly how that adaptability was implemented defines the many different neural networks described below. Yet modifications to a single parameter can never fully characterize the state (the pattern or distribution) of a system's input. Some ambiguity is always the result yet at the time this ambiguity was hailed as a great achievement of neural networks for many saw it as meaning that neural networks could work with partial knowledge even though this ambiguity could never be precisely defined and controlled.

So the achievement of adaptability in the neural networks of this time came at the cost of loosing the decision making resolution of logic. By 1963 this split brain researchers into the separate groupings of neural network researchers and artificial intelligence researchers. The 1963 paper by R.O. Winder seems to have been the last major paper which still attempted to combine the weight based and mathematical neural network approach with the logical artificial intelligence approach but its approach was to use mathematical techniques to find the desired weights of a network instead of having the network learn them. The result was that the neural network researchers continued to develop parameter based classifier circuits while the artificial intelligence researchers expanded and abstracted the binary logic approach to form propositional and prepositional logics which eventually resulted in the development of the programming language called LISP. Because of the precision of logic the field of artificial intelligence enjoyed great initial success. Yet its high level algorithmic and sequential approach eventually proved to have serious limitations in terms of adaptability and the handling of uncertainty.

 McCulloch, W.S. & Pitts, W.H. (1943). A Logical Calculus of the Ideas Immanent in Nervous Activity, Bulletin of Mathematical Biophysics, 5:115-137

Rashevsky, N (1938). Mathematical Biophysics, University of Chicago Press, Chicago, IL

Winder, R.O. (1964). Threshold Logic in Artificial Intelligence, Artificial Intelligence, IEEE Publication S-142, pages 107-128.

(Rashevsky, McCulloch, and Pitts also played a role in the development of systems science. See chapter 2 of the Macroscope by Joel de Rosnay)

The First Randomly Connected Reverberatory Networks (1954)

Neural network research really only became possible at the dawn of the computer age when ideas could be validated by simulation on various types of electronic calculators. The impetus for these simulations was provided by Donald Hebb of McGill University in Canada who in 1949 proposed this unidirectional variation of the Bain / James learning rule:

"Let us assume then that the persistence of repetition of a reverberatory activity (or trace) tends to induce lasting cellular changes that add to its stability. The assumption can be precisely stated as follows: When an axon of cell A is near enough to excite cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place on one or both cells so that A's efficiency as one of the cells firing B is increased" (Hebb 1949, p62)

Figure 5
The Farley and Clark Network

The first such Hebbian inspired network was simulated by Farley and Clark in 1954 on an early digital computer at M.I.T. Their network shown in figure 5 consisted of nodes representing neurons randomly connected with each other by unidirectional lines having multiplication factors (weights). Each node summed its inputs and produced an output of one if the sum of its inputs exceeded its threshold value. In order to recognize patterns the network was divided into quarters with the inputs from one pattern entering the front top quarter while the inputs from the other pattern entered the front bottom quarter. Only one pattern was presented at a time so each tended to produce its own unique flow path through the network. A pattern was considered to be recognized if it consistently produced more activity in one of the back network quarters than in the other.

Yet in order to get this network to work Farley and Clark had to modify Hebb's learning rule. This new rule required that the activity level of each network line be examined at each instant in time so that if the line value changed in the desired direction then all of its nodal input weights would be incremented (increased). If the change was in the wrong direction then the nodal input weights would be decremented (decreased). With this new directional rule the network was able to successfully discriminate between two widely differing patterns as long as they were presented alternately (see Farley, 1960). The major problem with this type of network is its lack of pattern discrimination resolution combined with the high number of neurons needed to make that discrimination. Also the cells in the network were not self-assembled as was thought to be necessary at the time but were assigned to assemblies (the quadrants) by the experimenters.

Hebb, D.O. (1949). The Organization of Behavior, John Wiley & Sons, New York

Farley, B., and Clark, W.A. (1954). "Simulation of self-Organizing Systems by Digital Computer", IRE Transactions on Information Theory 4:76-84

Farley, B.G. (1960) Self-Organizing Models for Learned Perception; in M.C. Yovits and S. Cameron (editors), Self-Organizing Systems, Pergamon Press, Oxford

The First Reverbatory Network Showing Self-Assembly 1956)

Figure 6
One Cell of the Self-Assembly Network

The next step was taken in 1956 by N. Rochester and friends using an early IBM Type 701 (2 Kbytes memory) and 704 Electronic Calculator (as computers were often called then) located at the IBM research labs in Poughkeepsie, New York. Instead of connecting the cells randomly overall as did Farley and Clark they organized their cells into a single layer and then randomly connected the cell's outputs back to other cell's inputs (figure 6). In accordance with Hebb's learning rule the weights were increased with use and they had this to say about it:

"No process of just this sort has been observed in living tissue. However, it has not been possible to demonstrate, by measurement, that the Hebb postulate is false. Nothing else has been observed that could account for learning and memory in a plausible way."

Despite this they were forced to modify Hebb's rule by normalizing all the weight values so that they always added up to some constant value. The reason was that with use all the weights would eventually increase to their maximum value. What was desired was that the more heavily used weights be larger than the lessor used weights so if any weight is increased the other weights must be decreased by an equivalent overall amount. (Significantly, they did not discuss how this process could be accomplished neurally).

They also tried neural habituation in their network which occured when a rapidly firing neuron increased it threshold so as to become less responsive. The result was a network which did not form any assemblies but which did show a continuing wave-like reverberation after the inputs ceased.

Meanwhile, Peter Milner, an associate of Donald Hebb, suggested that cell assemblies could form if most of the synapses within a cell assembly were exitory (positive) while those between cell assemblies were inhibitory (subtractive). So they changed the network weight ranges from 0 to 256 to -1 to 1. In order to speed up the simulation they ceased using binary pulses on the network lines and instead had the lines represent analog firing frequencies which in their model ranged from 0 to 15. They called this the F.M. model. The result was that cell assemblies built up around the input regions but only around those regions. They did not appear any where else in the network. In response to this partial success they said the following:

"This kind of investigation cannot prove how the brain works. It can, however, show that some models are unworkable and provide clues as to how to revise the models to make them work."

Rochester, N., Holland, J.H., Haibt L.H. and Duda, W.L. (1956). Tests on a Cell Assembly Theory of the Action of the Brain Using a Large Digital Computer, IRE Transaction of Information Theory IT-2:80-93

The First Pattern Regularity Detector - The Original Perceptron (1958)

Random neural networks were not having much success yet many researchers were stuck on the idea that the neural connections in the brain were mostly random. So a slight compromise was made in 1958 by Frank Rosenblatt who initiated a new phase in neural network research by abandoning the idea of self-forming cell assemblies. Instead memory was simply a change in the relation between some input and some output which changed with regular use. This meant that those patterns which regularly and consistently occurred would be learned. This type of network for unsupervised learning is called an auto-learning network or a pattern regularity detector. Consequently, the randomness of neural connections was not overall but only occurred between differing layers of neural cells. On page 388 of his 1958 paper he states the number one assumption of the perceptron's design as:

"The physical connections of the nervous system which are involved in learning and recognition are not identical from one organism to another. At birth, the construction of the most important networks is largely random, subject to a minimum number of genetic constraints."

The Original Perceptron was presented to answer the last two of these three fundamental questions about the brain (page 386):

  1. How is information about the physical world sensed, or detected, by the biological system?
  2. In what form is information stored, or remembered?
  3. How does information contained in storage, or in memory, influence recognition and behavior?

The simplicity and random connections of the Original Perceptron and the later Perceptrons made them a fascinating subject for mathematical analysis using probability theory. Rosenblatt's preference for this approach as opposed to the developing algorithmic approach of artificial intelligence is shown by this passage in his 1960 paper (page 301):

" Simulation should not, in general, be attempted without a theoretical analysis of the nerve net in question, sufficient to indicate questions of theoretical interest. The examination of arbitrary networks in the hope that they will yield something interesting, or the simulation of networks which have been specially designed to compute a particular function by a definite algorithmic procedure seem to be about equally lacking in value."
Figure 7
The Original Perceptron - Positive Inputs

The original Perceptron circuit shown in figure 7 consisted of many convergent type subcircuits feeding into a decision making element which only passed its greatest valued input (gate comparitor). A convergent subcircuit is characterized by several input lines feeding into some central operation or series of operations. In this case the central operation is a summation node connected to a threshold unit which in turn connects to a multiplication factor (weight). Rosenblatt called these subcircuits "A" units for association units. Each received a certain number of binary (0 or 1) positive inputs and a lessor number of negative (subtractive) inputs connected in a random manner. If the sum of these inputs exceeded the threshold value then the threshold would produce an output value of 1. This value was then modified by the weight valued between 0 and 1 which Rosenblatt called the value of the "A" unit.

The output of each convergent subcircuit was next sent to the gate comparitor which selected its largest input value, passed it on and fed it back to its convergent subcircuit's ("A" unit's) weight to increment it by a certain amount. In this way the most used template would would increase its transmission efficiency in conformity to the idea's of Donald Hebb mentioned above. Any neural network using a gate comparitor is known as a competitive network because the different convergent subcircuits compete to get their signal selected.

The irony here is that a gate comparitor seems to be impossible to form beyond the two input, two output level using only the mathematical operations of addition and subtraction (in such a case the circuit is equivalent to the EXCLUSIVE OR circuit of figure 1 without the final summation node). In all the competitive type of neural networks this function is either left vague of is simply stated as a mathematical function. In Rosenblatt's paper the gate comparitor function is simply discussed at the two input, two output level. This should have been a clue that something more besides addition and subtraction operations were required in neural networks (gate comparitors of any size are easy to construct using multivalued logic operations).

Because the inputs to the convergent subcircuits are randomly connected every subcircuit is already somewhat pre-tuned to respond to some pattern. The only purpose of the multiplication factors (weights) is to insure the correct convergent subcircuit selection by the gate comparitor in situations where the positive inputs from two or more patterns overlap as shown in figure 7. In such a case both patterns will exceed the threshold value leaving weight value to make the final discrimination.

Figure 8
The Original Perceptron - Positive and Negative Inputs for Pattern Subset Discrimination

The major test of any neural network's ability to classify patterns is the subset test in which the neural network must discriminate a pattern which is smaller and completely contained within another larger pattern. A variation of this test is the so called "exclusive or" test in which two different smaller patterns are contained within the larger pattern and all three must be discriminated. The Original Perceptron is not always able to pass the subset test if it only uses positive inputs (figure 7) yet it can if negative inputs are used (figure 8). The ability of the all positive input Original Perceptrons to discriminate between a full pattern and its subset (partial) pattern depends on the order of pattern presentation and on luck. If the partial match pattern is presented first then both the top and bottom convergent subcircuits of figure 7 would respond equally given equal initial weights. Yet in true marketing fashion, this lack of discrimination ability was touted as the significant brain-like characteristic of generalization in which the pattern does not have to be an exact match in order for it to be classified! Still, the lack of control over this process is a weakness of the original positive input only perceptron.

This subset pattern discrimination weakness in the all positive input Original Perceptron can be remedied to a certain degree if the total multiplication factor values (weights) of the network remain constant (if they add up to one this requirement is called normalization). Thus whenever a weight is incremented a certain amount all the other weights together must be decremented (reduced) the same overall amount. This biases the response towards the more recently presented and larger patterns. Thus in figure 7 if the the full pattern for the top subcircuit has been presented most recently then the presentation of the partial pattern will also activate the top subcircuit. Instead, if the full pattern for the middle subcircuit has been presented most recently then the presentation of the partial pattern will activate the middle subcircuit. While this is an interesting phenomena weight normalization does not seem to be something that real neuronal circuits could accomplish.

The use of negative inputs as shown in figure 8 partly overcomes the subset pattern discrimination problem by having larger patterns inhibit the smaller subset patterns. A template is a spatial filter, passing only that pattern (or image) which exactly matches its positive inputs. If any part of that pattern exceeds (overlaps) a template's border then negative inputs will inhibit the activation of the convergent subcircuit. A partial template leaves part of the possible pattern border undefined (no inhibition). Another way to characterize templates with border inhibition is by hardness and softness. A hard template will completely inhibit the convergent subcircuit if any part of the pattern overlaps the negative inputs. A soft template requires a large degree of overlap onto some inhibitory region in order to suppress the convergent subcircuit. The network template matching function has an algorithmic equivalent called convolution. In the Original Perceptron the negative inputs were just put in at random in the hope that they might accomplish the partial template function for some presented set of patterns.

In the Original Perceptron overlapping patterns could also have been discriminated by adaptively changing the values of the thresholds but this does not seem to have been investigated. For example, a threshold value of 3 can easily discriminate between a pattern having 3 positive binary inputs and one having 2 positive binary inputs.

Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Psychological Review, 65:386-408

Rosenblatt, F (1960). Perceptron Simulation Experiments, Proceedings of the IRE, 48:301-309

Neural Pattern Classification Arrives - ADALINE (1960)

ADALINE is an acronym for ADAptive LINear element. For the first time a convergent type subcircuit having weights before the summation node is used to formally classify patterns. Adaline was developed by Bernard Widrow and Marcian Hoff for the purpose of recognizing binary patterns so that it could predict the next bit in a flowing stream of bits. As such it was based on the ideas of R.L. Mattson, then a student at MIT. Consequently, the paper Widrow and Hoff published is an engineering paper concerned with computing which makes no mention of the brain or brain memory. Perhaps because of this Adalines do not use a threshold.

Figure 9
Adaline - No Convergence of Weights to a Stable Value

Adalines are based on the use of an attractive (as opposed to repulsive) goal seeking learning procedure in which a convergent subcircuit output value is defined as the goal for each pattern. Consequently, the subcircuit must learn the proper weight values to produce that goal value for any given set of input patterns. This is in contrast to the later perceptrons which were repulsive driven meaning that a specified goal value is not defined for each subcircuit, only misclassification errors. Even less complete error information leads to reinforcement learning which only provides the information that an error or success has occurred without the magnitude or direction (type) of that error. Positive reinforcement is the attractive form while negative reinforcement is the repulsive form.

Figure 9 shows an adaline using a learning procedure which does not work while figure 10 shows a procedure which does work. The patterns chosen for this example again demonstrate the subset problem in which one pattern is completely contained within another pattern. This type of discrimination is the most difficult for convergent type sub-circuits to discriminate.

Learning begins in figure 9 by first choosing a goal value which the pattern presented to the Adaline is supposed to match. In Widrow and Hoff's paper these were 1, 0, or -1. Pattern One is supposed to produce a 1. It is presented at time = 1 and the error between the goal value and the actual Adaline value is noted. The error value is then divided among the weights which thus eliminates that error. Next Pattern Two is presented which is supposed to produce a 0 and the weights are adjusted again to eliminate the error. Next pattern 1 is presented again at time = 3 and it produces an error of 0.2 which will change the weights back to what they were originally with the result that the network is not learning the patterns. Yet this method will work sometimes and this is the procedure used in Widrow and Hoff's first paper.

A more successful procedure was found by Widrow in 1962 and it is called the Widrow-Hoff learning rule or the Delta learning rule. It is based on the realization that the greatest sources of the error are from the active lines. Consequently, the Widrow-Hoff learning rule changes the value of each weight in proportion to its pre-weight line value (in this case 1 or 0) according to the following rule:

Weight Change = (Pre-Weight Line Value) * (Error / (Number of Inputs) ).
Figure 10
Adaline - With Convergence

While this is the official Delta Rule notice that it does not conserve the error value in accordance with Widrow and Hoff's first paper. With the error value per weight divided by the number of inputs not all of it may be used. (although repeated presentations will gradually reduce this error). Yet if the error is conserved as is done in figure 10 so that all of it is distributed to the weights then the error is totally eliminated (see time = 2 of figure 10). Using error conservation increases the rate of learning.

Instead of just using the pre-weight line value to indicate the greatest source of error a variation would be to include the weight values as well since the post-weight line value give a more accurate measure of error assignment. The trade-off is a slightly more complex learning rule and slower learning if the error value is not conserved since smaller numbers are the result. In fact, this variation is the rule used in the Back Propagation networks discussed below. (This is part of a larger problem in neural network research known as the credit assignment problem. As networks get more complex determining the source of any resultant error also becomes more complex.)

What happens with the Widrow-Hoff procedure is that all those input lines representing features which are common to the patterns will tend to zero. This leaves the non-overlapping pattern features to define the convergent subcircuit output value. Consequently, only one ADALINE is needed for any set of patterns making it a compact solution for a pattern set having patterns with some unique feature. Yet not all pattern pattern sets exhibit this property. Consider the set having patterns: 111, 011, and 110 (the "exclusive or" problem again). No one pattern has a unique feature which means the ADALINE will never converge to zero error. In these situations a modification of the Widrow-Hoff procedure is used called the Relaxation Procedure in which the input values used to modify the weights are normalized (that is add up to some constant number, usually one). This insures that large patterns do not overly bias the learning. (for a mathematical description of various other less used learning procedures see chapter 5 in the text by Duda and Hart).

Figure 11
Linear Discrimination from Convergent Type Subcircuits

ADALINES do not need to use binary input patterns but can also use analog inputs. Then the goal value for each input pattern represents a constant valued line relative to the inputs as shown in figure 11. The learning procedure then shifts the line (yellow) so that it tends to connect the two goal lines. This shows that weight and summation node convergent subcircuits always produce straight lines, in other words they are linear. Any analog pattern must lie on a straight line in order for it to be completely learned by any ADALINE with zero error convergence. This is an important limitation of these kinds of circuits. Not realized at the time was the significance of the error value used in the ADALINEs. Logical negation of error values produces multivalued and fuzzy logic certainty values which allows downstream circuits to work with a "degree of matching" signal.

Duda, R.O. Hart, P.E. (1973). Pattern Classification and Scene Analysis, John Wiley & Sons, New York

Mattson, R.L. (1959). "The Design and analysis of an Adaptive System for Statistical Classification", S.M. Thesis, MIT May 22, 1959

Mattson, R.L. (1959). "A Self-Organizing Logical System", Eastern Joint Computer Conference Record, I.R.E., N.Y.

Widrow, Bernard, and Hoff, Marcian, E. (1960). Adaptive Switching Circuits, 1960 IRE WESCON Convention Record, New York: IRE pp. 96-104

Widrow, Bernard (1962) Generalization and information Storage in Networks of Adaline Neurons. In M.C. Yovits, G.T. Jacobi, & G.D. Goldstein (Eds.), Self-Organizing Systems. Washington D.C.: Spartan Books.


Web site by David D. Olmsted. He can be contacted at brainsim1-contact at yahoo dot com (this is an anti-spam tactic. Type the address as normal). Original site established August 21, 1998 by David D. Olmsted. New home page published August 25, 2006

Information compiled by David D. Olmsted © 1998 to 2006 (Free to use for personal and educational use)