Inducting hybrid models of task learning from visualmotor data

Inducing hybrid models of task learning from visualmotor data Devika Subramanian devika@cs.rice.edu Department of Computer Science, Rice University, 6100 Main St MS 132, Houston TX 77005 Abstract We develop a new hybrid model of human learning on the NRL Navigation Task (Gordon et. al. 1994). Unlike our previous efforts (Gordon & Subramanian, 1997) in which our model was crafted from verbal protocols and eyetracker data, we demonstrate the feasibility of us- ing visualmotor data (time series of sensor-action pairs) gathered during training to construct models of a sub- ject’s strategy. The goal of our cognitive modeling is to provide a sufficiently detailed description of the sub- ject’s strategic misconceptions in real-time, in order to tailor a personalized, task training protocol. Using a small-parameter hybrid model that can be estimated directly and efficiently from the visualmotor data, we study the deviation of the subject’s action choices from that dictated by a near-optimal policy for the task. This model gives us a clear description of the subject’s cur- rent strategy relative to the near-optimal policy, thus di- rectly suggesting performance hints to the subject. We also provide evidence that our model parameters are suf- ficient to account for individual differences in learning performance. Introduction Our goal is to build computational models of hu- mans learning to perform complex visualmotor tasks. By a model of human learning, we mean an explicit representation of the human’s action poli- cies (mapping from the perceptual inputs to motor actions) and its evolution over time. The models will be used in designing personalized training pro- tocols to help humans achieve high levels of com- petence on these tasks. This intended use places constraints on the class of models we can consider and the methods for evaluating them. In particu- lar, the models need to be detailed enough to pin- point problems in a subject’s learning; yet be coarse enough to be unambiguously built from the avail- able visualmotor learning data. Our criterion for evaluating models is empirical: they must ac- curately identify incorrect aspects of the subject’s strategy, and (ii) when used in place of the human, they must yield comparable performance. A major challenge in this endeavour is the fact that the visualmotor data are at an extremely low level. One approach to modeling in such a situation is to start with a cognitive architecture, and then to find parameter settings for that architecture which recreate the available low—level data. This tactic is adopted by Newell in UTC, Anderson in ACT* and in EPIC by Kieras and Meyer. We take an alter- native approach here based on behavioral cloning (Sammut et. al., 1998). In our approach, the low- level visualmotor data is taken as the ground truth, and using ideas from machine learning and data mining we “compress” the data in the form of a pol- icy which maps sensors to actions. If there are high level regularities at the policy level in the learning data, they will be reliably extracted by our learning algorithms. This approach has the advantage that cognitive modeling constructs arise endogenously from the data, rather than being stipulated a priori. Our task domain is the NRL Navigation task (Gordon, et al., 1994) developed by Alan Schultz at the Naval Research Laboratory (NRL). It re- quires piloting an underwater vehicle through a field of mines guided by a small suite of sonar, range, bearing and fuel sensors. Sensor information is pre- sented via an instrument panel that is updated in real—time. The sensors are noisy. Decisions about motion of the vehicle (speed and turn) are commu- nicated via a joystick interface. The task objec- tive is to rendezvous with a stationary target be- fore exhausting fuel and without hitting the mines. The mines may be stationary or drifting. A trial or episode begins with the vehicle being randomly placed on one side of a mine field and ends with one of three possible outcomes: the vehicle reaches the target, hits a mine, or exhausts its fuel. Reinforce- ment, in the form of a scalar reward dependent on the outcome, is received at the end of each episode. Since the mine configurations vary from episode to episode, it is fruitless for subjects to memorize a