John Hale John Hale
connectionism, symbolic cognition & the past-tense debate
version of Apr 8, 1999

past-tense

When you want to indicate that something has happened prior to now, you inflect a verb to ``put it in the past tense.'' The exact way to do this is part of your knowledge of morphology.

Lots of verbs are ``regular''
walk -> walked
kick -> kicked
howl -> howled
procrastinate -> procrastinated
...
Some of these are ``irregular''
be -> was
go -> went
come -> came
strike -> struck
...

Assuming that computation = cognition, what kind of computation do people in fact do to work out what the correct inflection is?

possible solutions

  1. rule for regulars, memory slots for irregulars
  2. PDP neural net for all
  3. rule for regulars, neural net for irregulars

rote memorization

Rote memorization is a symbolic computational theory that describes English past-tense using just 1 rule. It also requires some kind of memory to store irregular forms: this memory is a sequence of slots, like a computer memory.

Knowing regular inflection is knowing the regular inflection rule. By contrast all irregular verbs are memorized. Sometimes this memory fails, and instead, the regular inflectional system fires by mistake, resulting in an ``overregularization error'' i.e. breaked. On this view, only the regular system is productive - only the regular system can spontaneously apply to novel words.

On this theory, we should expect all unknown words to be inflected by the regular system. But in fact this doesn't happen. When presented with nonsense words like spling, lots of people inflect it as splung (Pinker and others).

Inflection immediately starts to follow the rule-like pattern once the learner hits upon the correct rule. This corresponds to an observed developmental stage around age 3 at which all verbs are treated as regular, e.g. goed. This called the U-shaped learning curve because performance has to go down before it can go back up again.

connectionist analogizing

Rumelhart & McClelland 1986 present a parallel distributed processing model that realizes the intuition ``kids inflect new verbs like the ones they already know.'' Their central goal was to accurately model the U-shaped curve without resorting to explicit computational rules.

neural network
connected group of mathematically idealized numerical processing units that can be simulated by computer
connectionism
cognitive science that models the mind with neural networks

parallel distributed processing
particular variety of connectionism

A set of processing units
A state of activation
An output function for each unit
A pattern of connectivity among units
quick review of PDP models:A propagation rule for propagating patterns of activities through the network of connectivities
An activation rule for combining the inputs impinging on a unit with the current state of that unit to produce a new level of activation for the unit
A learning rule whereby patterns of connectivity are modified by experience
An environment within which the system must operate

(Rumelhart, Hinton & McClelland p46)

Rumelhart and McClelland used the presence (1) or nonpresence (0) of phonetic features of sounds to define the state of activation in their neural net. They translated each word into a 460-digit string of 1s and 0s. The patterns for the stems were associated with the correctly-inflected past-tense form using a learning method called the ``perceptron learning rule.''

perceptron
one of the original neural nets promoted by Frank Rosenblatt in 1958, later attacked by Marvin Minsky and Seymour Papert in 1969.

In the perceptron learning rule, connectivity weights are adjusted by a teacher if the network gives the wrong answer. The weight on the line coming into the errant node is increased (output value was too low) or decreased (output value was too high) depending on what the problem with actual response was.

To simulate the earliest phase of past-tense learning, the model was first trained on the 10 high-frequency verbs, receiving 10 cycles of training presentations through the set of 10 verbs. This was enough to produce quite good performance on these verbs. We take the performance of the model at this point to correspond to the performance of a child in Phase 1 of acquisition. To simulate later phases of learning, the 410 medium-frequency verbs were added to the first 10 verbs, and the system was given 190 more learning trials, with each trial consisting of one presentation of each of the 420 verbs. The responses of the model early on in this phase of training correspond to Phase 2 of the acquisition process; its ultimate performance at the end of 190 exposures to each of the 420 verbs corresponds to Phase 3.

(Rumelhart & McClelland, p241)

If the data are presented in this order, the perceptron learning rule finds a function from vectors to vectors that can be interpreted as inflecting the English past tense pretty well.

symbolism strikes back

Pinker & Prince 88 demonstrates that the Rumelhart & McClelland 86 past-tense network has many shortcomings, including these

it cannot represent certain words
not all of the 460-digit representations uniquely pick out words
it cannot learn many rules
consider high-sticked/*high-stuck V->N->V
it can learn rules found in no human language
since the perceptron learning rule could just have easily learned ``reverse the letters'' the models says little about Language
it cannot explain morphological and phonological regularities
the RM model's phonetic features are at a level too low to handle more abstract similarities
it fails at its assigned task of mastering the English PT
20 out of 72 = 28%

Data from an intensive study of children's language acquisition (Brown, 1973) indicate that children's vocabulary does not undergo changes as radical as the changes to the input mix in the RM86 model.

conclusion: RM86 is a single route model which cannot accurately characterize the facts about the acquisition of the English past-tense.

a dual route

In a series of articles, (Pinker 1991 and Prasada & Pinker 1993) Steven Pinker describes a ``dual route'' model which is intended to capture the best parts of the symbolist and connectionist accounts of the past-tense.

Regular verbs computed by a suffixation rule just as in rule-and-rote theory. Irregular verbs (only) retrieved from an associative memory.

associative memory
a memory that retrieves data using incomplete fragments of the desired item as a clue. neural networks have been characterized as associative memories, although other implementations (ie searching) are possible

Subjects inflect nonsense words that sound like irregulars in a way similar to irregulars that they know, rather than always overregularizing.

A continuous effect of similarity has been measured experimentally: subjects frequently (44%) convert spling to splung (based on string, sling, et cetera), less often (24%) convert shink to shunk, and rarely (7%) convert sid to sud.

(Pinker 1991, p532)
On this view, the irregular system, but not the regular system, should show frequency effects:

idioms
irregulars, but not regulars sound weird in idioms: forgo/forwent
rating
niceness rating uncorrelated with niceness of stem in irregulars, but correlated in regulars
reaction time
frequency effect for irregulars but not regulars
priming
priming by regular PT is just like priming by stem, unlike priming by irregular PT

neuropsychological evidence

If there do exist two different inflectional routes, one for regulars and one for irregulars, then we should find people who are impaired on one but not the other.

SJD is a 47-year-old college educated female who suffered a thromboembolic left-hemisphere stroke in June 1984...

SJD's speech is characterised by fluent, usually complete sentences with occasionally morphological and function word errors, semantic paraphasias, phonemic paraphasias, and hesitations for word-retrieval.

... the contrast between her performance on the regularly inflected words and that for the irregularly inflected verbs indicates that it is the morpho-phonemic structure of the inflected words, and not their morpho-semantic complexity, that is relevant to this effect.

(Badecker and Caramazza, 1991 p341)

Patients like SJD are taken as evidence supporting a dual route model, because it looks like the two routes can be neurologically damaged independent of one another. The distributed nature of the Rumelhart & McClelland 86 model suggests that any damage (snipping of connections?) would globally make things worse, for all verbs regular or irregular. In a PDP model, no single connection holds all of the network's knowledge about a particular kind of input.

what's a route?

Dual-route, single-route issue is about the character of the human computation of inflection. Multi-route proponents must prove that each route implements a qualitatively different kind of function. Single-route proponents must show that the architecture of their one mechanism is sufficient to realize all of the different kinds of inflection that we see, while not being so powerful that it fails to say what is so human about the computation.

Philosophy of science question All of the exceptional cases that the ``default'' route can't handle (i.e. irregular verbs) are lumped into the ``nondefault'' route. The nondefault route must be pretty powerful since it is doing all the hard work. How is this an explanation?


File translated from TEX by TTH, version 2.20.
On 8 Apr 1999, 19:39. Cleaned up by John Hale