Learning Involves Attention

LEARNING INVOLVES ATTENTION 2 Attention in learning One of the primary factors in the resurgence of connectionist modeling is these models’ ability to learn input-output mappings. Simply by presenting the models with examples of inputs and the corresponding outputs, the models can learn to reproduce the examples and to generalize in interesting ways. After the limitations of perceptron learning (Minsky & Papert, 1969; Rosenblatt, 1958) were overcome, most notably by the backpropagation algorithm (Rumelhart, Hinton, & Williams, 1986) but also by other ingenious learning methods (e.g., Ackley, Hinton, & Sejnowski, 1985; Hopfield, 1982), connectionist learning models exploded into popularity. Connectionist models provide a rich language in which to express theories of associative learning. Architectures and learning rules abound, all waiting to be explored and tested for their ability to account for learning by humans or other animals. A thesis of this chapter is that connectionist learning models must incorporate rapidly shifting selective attention and the ability to learn attentional redistributions. This kind of attentional shifting is not only necessary to mimic learning by humans and other animals, it is also a highly effective and rational solution to the demands of learning many new associations as quickly as possible. This chapter describes three experiments (one previously published and two new) that demonstrate the action of attentional learning. All the results are fit by connectionist models that shift and learn attention, but the results cannot be fit when the attention mechanisms are shut off. Shifts of attention facilitate learning A basic fact of learning is that people quickly learn new associations without rapidly forgetting old associations. Presumably this ability is highly adaptive for any creature that confronts a rich and complex environment. Consider a hypothetical situation in which an animal learns that mushrooms with a round top and smooth texture are tasty and nutritious. After successfully using this knowledge for some time, the animal encounters a new mushroom with a smooth texture but a flat top. This mushroom turns out to induce nausea. How is the animal to quickly learn about this new kind of mushroom, without destroying still-useful knowledge about the old kind of mushroom? If the animal learns to associate both features of the new mushroom with nausea, then it will inappropriately destroy part of its previous knowledge about healthy mushrooms. That is, the previous association from smooth texture to edibility will be destroyed. On the other hand, if the old association is retained, it generates a conflicting response (i.e., eating the mushroom). To facilitate learning about the new case, it would be advantageous to selectively attend to the distinctive feature, namely, flat top, and learn to associate this feature with nausea. By selectively attending to the distinctive feature, previous knowledge is preserved, and new learning is facilitated. Not only should attention be shifted in this way to facilitate learning, but the shifted attentional distribution should itself be learned: Whenever the animal encounters a mushroom with smooth texture and flat top, it should shift attention to the flat top, away from the smooth texture. This will allow the animal to properly anticipate nausea, and to avoid the mushroom.1 The third example in this chap1An alternative possible solution would be to encode the entire configuration of features in each type of mushroom, and to disallow any generalization on the basis of partially matched configurations. In this way, knowledge about smooth LEARNING INVOLVES ATTENTION 3 ter describes a situation in which people use exactly this kind of attentional shifting during learning. The challenge to the theorist is expressing these intuitions about attention in a fully specified model. Shifts of attention can be assessed by subsequent learning The term “attention,” as used here, refers to both the influence of a feature on an immediate response and the influence of a feature on learning. If a feature is being strongly attended to, then that feature should have a strong influence on the immediate response and on the imminent learning. This latter influence of attention on learning is sometimes referred to as the feature’s associability. In this chapter, these two influences of attention are treated synonymously. This treatment is a natural consequence of the connectionist models described below, but the treatment might ultimately turn out to be inappropriate in the face of future data. Because redistribution of attention is a learned response to stimuli, the degree of attentional learning can be assayed by examining subsequent learning ability. If a person has learned that a particular feature is highly indicative of an appropriate response, then, presumably, the person has also learned to attend to that feature. If subsequent training makes a different feature relevant to new responses, then learning about this new correspondence should be relatively slow, because the person will have to unlearn the attention given to the now-irrelevant feature. In general, learned attention to features or dimensions can be inferred from the ease with which subsequent associations are learned. This technique is used in all three examples presented below. Intraand extra-dimensional shifts A traditional learning paradigm in psychology investigates perseveration of learned attention across phases of training. In the first phase, participants learn that one stimulus dimension is relevant to the outcome while other dimensions are irrelevant. In the second phase, the mapping of stimuli to outcomes changes so that either a different dimension is relevant or the same dimension remains relevant. The former change of relevance is called extradimensional shift, and the latter change is called intradimensional shift. Many studies in many species have shown that intradimensional shift is easier than extradimensional shift, a fact that can be explained by the hypothesis that subjects learn to attend to the relevant dimension, and this attentional shift perseverates into the second phase (e.g., Mackintosh, 1965; Wolff, 1967). In this section of the chapter, a recent experiment demonstrating this difference is summarized, and a connectionist model that incorporates attentional learning is shown to fit the data, whereas the model cannot fit the data if its attentional learning mechanism is “turned off.” Experiment design and results Consider the simple line drawings of freight train box cars shown in Figure 1. They vary on three binary dimensions: height, door position, and wheel color. In an experiment conducted in my round mushrooms would not interfere with knowledge about smooth flat mushrooms, despite the fact that both pieces of knowledge include the feature smoothness. A problem with this approach is that knowledge does not generalize from learned cases to novel cases, yet generalization is perhaps the most fundamental goal of learning in the first place. For a discussion of configural and elemental learning, see the chapter by Shanks in this collection. LEARNING INVOLVES ATTENTION 4 Figure 1. Stimuli used for relevance shift experiment of Kruschke (1996b). (The ovals merely demarcate the different stimuli and are not part of the stimuli per se. The lines connecting the ovals indicate the dimensions of variation between stimuli.) lab (Kruschke, 1996b), people learned to classify these cars into one of two routes. On each trial in a series, a car would appear on a computer screen, the learner would make his or her choice of the route of the car by pressing a corresponding key, and then the correct route would be displayed. During the first few trials, the learner could only guess, but after many trials, she or he could learn the correct answers. Figure 2 indicates the mapping of cars to routes. The the cubes in Figure 2 correspond with the cube shown in Figure 1. Each corner is marked with a disk whose color indicates the route taken by the corresponding train; in other words, the color of the disk indicates the category of the stimulus. The left side of Figure 2 shows the categorization learned in the first phase of training, and the right side shows the categorization learned subsequently. In the first phase, it can be seen that the vertical dimension is irrelevant. This means that variation on the vertical dimension produces no variation in categorization: The vertical dimension can be ignored with no loss in categorization accuracy. The other two dimensions, however, are relevant in the first phase. Some readers might recognize this as the exclusive-or (XOR) structure on the two relevant dimensions. In the subsequent phase, some learners experienced a change to the top-right structure of Figure 2, and other learners experienced a change to the bottom-right structure. In both of these secondphase structures only one dimension is relevant, but in the top shift this relevant dimension was one of the initially relevant dimensions, so the shift of relevance is called intradimensional, whereas in LEARNING INVOLVES ATTENTION 5

[1]  L. Allan,et al.  The widespread influence of the Rescorla-Wagner model , 1996, Psychonomic bulletin & review.

[2]  S. Kosslyn,et al.  From learning processes to cognitive processes , 1992 .

[3]  G. Bower,et al.  From conditioning to category learning: an adaptive network model. , 1988 .

[4]  J. Kruschke,et al.  ALCOVE: an exemplar-based connectionist model of category learning. , 1992, Psychological review.

[5]  N. Mackintosh A Theory of Attention: Variations in the Associability of Stimuli with Reinforcement , 1975 .

[6]  D. Medin,et al.  Problem structure and the use of base-rate information from experience. , 1988, Journal of experimental psychology. General.

[7]  John K. Kruschke,et al.  Dimensional Relevance Shifts in Category Learning , 1996, Connect. Sci..

[8]  J L Wolff,et al.  Concept-shift and discrimination-reversal learning in humans. , 1967, Psychological bulletin.

[9]  K. Haberlandt,et al.  Stimulus selection in animal discrimination learning. , 1968, Journal of experimental psychology.

[10]  David R. Shanks Connectionist Accounts of the Inverse Base-rate Effect in Categorization , 1992 .

[11]  N. Schmajuk,et al.  Stimulus configuration, classical conditioning, and hippocampal function. , 1992, Psychological review.

[12]  R. Rescorla A theory of pavlovian conditioning: The effectiveness of reinforcement and non-reinforcement , 1972 .

[13]  J. Kruschke Base rates in category learning. , 1996, Journal of experimental psychology. Learning, memory, and cognition.

[14]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[15]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[16]  John R. Anderson,et al.  The Adaptive Character of Thought , 1990 .

[17]  J. Pearce Similarity and discrimination: a selective review and a connectionist model. , 1994, Psychological review.

[18]  N. Castellan Multiple-cue probability learning with irrelevant cues☆ , 1973 .

[19]  N. Mackintosh,et al.  Mechanisms of animal discrimination learning , 1971 .

[20]  W. R. Garner The Processing of Information and Structure , 1974 .

[21]  R. Shepard Attention and the metric structure of the stimulus space. , 1964 .

[22]  N. J. Slamecka A methodological analysis of shift paradigms in human discrimination learning. , 1968, Psychological bulletin.

[23]  M. Gluck,et al.  Psychobiological models of hippocampal function in learning and memory. , 1997, Annual review of psychology.

[24]  L. Kamin Attention-like processes in classical conditioning , 1967 .

[25]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[26]  John J. Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities , 1999 .

[27]  J. Kruschke,et al.  A model of probabilistic category learning. , 1999, Journal of experimental psychology. Learning, memory, and cognition.

[28]  R. Shepard,et al.  Toward a universal law of generalization for psychological science. , 1987, Science.

[29]  J. Kruschke Toward a unified model of attention in associative learning , 2001 .

[30]  Ralph R. Miller,et al.  Assessment of the Rescorla-Wagner model. , 1995 .

[31]  John K. Kruschke,et al.  Shifting attention in cued recall , 1998 .

[32]  R. Nosofsky Exemplars, prototypes, and similarity rules. , 1992 .

[33]  N. Mackintosh,et al.  Blocking as a Function of Novelty of CS and Predictability of UCS , 1971, The Quarterly journal of experimental psychology.

[34]  Nathaniel J. Blair,et al.  Blocking and backward blocking involve learned inattention , 2000, Psychonomic bulletin & review.

[35]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[36]  B. Williams Associative competition in operant conditioning: blocking the response-reinforcer association , 1999, Psychonomic bulletin & review.

[37]  N. Mackintosh SELECTIVE ATTENTION IN ANIMAL DISCRIMINATION LEARNING. , 1965, Psychological bulletin.

[38]  John K. Kruschke,et al.  Associative learning in baboons (Papio papio) and humans (Homo sapiens): species differences in learned attention to visual features , 1998, Animal Cognition.