Using Automated Within-Subject Invisible Experiments to Test the Effectiveness of Automated Vocabulary Assistance

Machine learning offers the potential to allow an intelligent tutoring system to learn effective tutoring strategies. A necessary prerequisite to learning an effective strategy is being able to automatically test a strategy’s effectiveness. We conducted an automated, within-subject “invisible experiment” to test the effectiveness of a particular form of vocabulary instruction in a Reading Tutor that listens. Both conditions were in the context of assisted oral reading with the computer. The control condition was encountering a word in a story. The experimental condition was first reading a short automatically generated “factoid” about the word, such as “cheetah can be a kind of cat. Is it here?” and then reading the sentence from the story containing the target word. The initial analysis revealed no significant difference between the conditions. Further inspection revealed that sometimes students benefited from receiving help on “hard” or infrequent words. Designing, implementing, and analyzing this experiment shed light not only on the particular vocabulary help tested, but also on the machine-learning-inspired methodology we used to test the effectiveness of this tutorial action. 1 How can tutors learn? Good human teachers learn what works best for which students in different contexts. In contrast, automated tutors generally learn little if anything from their interactions with students. This may be one reason why their effectiveness – though sometimes surpassing conventional classroom instruction – still lags behind individual human tutoring. Yet an automated tutor could potentially learn from individual interaction with many more students than a human could tutor in a lifetime. Learning to tutor better means learning to make better tutorial choices. Automated tutors embody many choices, such as which task to pose next, or what help to give. Such choices may be built in to the tutor design, or computed at runtime. These choices are presumably crucial to educational effectiveness. For example, a study of oneon-one human tutoring [Juel, 1996] found that successful tutor-student dyads engaged in a significantly different distribution of activities than less successful dyads. However, individual tutorial choices are difficult to evaluate. First, establishing student improvement may be challenging. Second, attributing a particular improvement to a specific tutorial decision is also difficult. Testing improvement immediately after a tutorial intervention may reduce interactions between interventions, but may also allow recency effects to dominate the experimental outcome. Assigning credit for outcomes to specific tutorial decisions can also be simplified if outcomes can be factored in terms of which tutorial actions are likely to affect them. In particular, in this paper we assume that vocabulary gains are the sum of independent gains on individual words, affected only by exposure to that word. This assumption is a simplifying approximation, because learning about one word such as tusk may help a student learn about another word such as elephant. Nonetheless, assuming independence of vocabulary gains lets us relate outcomes on individual words to the choices involving the student’s Lecture Notes in Computer Science 2 encounters with that word. Finally, even if a strong model relates instructional interventions to educational outcomes, conventional between-student controlled experiments can compare only a few alternatives out of the astronomical number of possible combinations, given the expense of assigning a large enough sample of students to each condition to obtain statistically informative results. Moreover, such comparisons reveal only which design works better overall for which students. They do not characterize the specific contexts in which each compared behavior works best. We need more efficient methods for learning to make better tutorial choices. Learning from experience involves trying out different choices and evaluating their effects. Thus a tutor that learns needs to systematically explore different tutorial choices, assess their effects on students’ educational gains, and apportion credit for those effects among the series of choices that led to them. Analogously, a spoken dialog system that learns to improve the quality of its interactions needs to explore alternative responses, assess their effects on customer satisfaction or other outcome measures, and infer when to use each response [Walker et al. 1997, Singh et al. 1999]. We have therefore been exploring a novel methodology for evaluating tutorial methods, made possible by automated tutors. In this paradigm, which we call “invisible experiments,” an automated tutorial agent randomly selects from a set of felicitous (context-appropriate) behaviors, and records the machine-observable effects of each such decision on subsequent dialog. Aggregating over many randomized trials then enables us to evaluate effects of different conversational behaviors on human-tutor dialog. Conventional experiments assess the overall effects of a particular tutor design on tutorial effectiveness. In contrast, invisible experiments offer a controlled assessment of the fine-grained effects of a given tutorial choice in various contexts, compared to what would happen otherwise. Thus they illuminate why and when a choice succeeds. 2 Project LISTEN’s Reading Tutor Project LISTEN’s Reading Tutor listens to children read aloud, and helps them learn to read [Mostow & Aist CALICO 1999]. The Reading Tutor displays one sentence at a time to the student, listens to the student read all or part of the sentence aloud, and responds expressively using recorded human voices. The design of the Reading Tutor addresses the educational goal of learning to read, while balancing motivational factors such as confidence, challenge, curiosity, and control [Lepper et al. 1993]. Therefore, the Reading Tutor lets children choose stories from a variety of genres, including nonfiction, fictional narratives, and poems. 3 Teaching vocabulary during assisted oral reading Part of helping children learn to read is helping them make the most of encounters with new vocabulary. As children transition from learning to read into reading to learn, they must be able to understand what they read. Having a good vocabulary is essential to reading for understanding. One of the best ways to teach vocabulary is to have the student read. However, encountering a word in a sentence may not be sufficient to ensure that a student learns the meaning of the word. We wanted to explore ways of augmenting text to help children learn words better than they would from the unaugmented text. Lecture Notes in Computer Science 3 4 Experiment: Does automated vocabulary assistance help? We conducted an experiment to test if augmenting text with information about words would help children learn the meanings of those words better than they would have from the text alone. In Fall 1999, 60 children in six grade 2-3 classrooms read stories using a version of the Reading Tutor modified to provide extra vocabulary help on some words. We augmented some words in stories the child was reading with synonyms (X means Y), antonyms (X is the opposite of Y) or hypernyms (X is a kind of Y). For a given child, some of the words were augmented and others were left unaugmented to serve as a control group. These synonyms, antonyms, or hypernyms were retrieved from WordNet [http://www.cogsci.princeton.edu/~wn/w3wn.html], a lexical database. The words selected were words with only one or two senses in WordNet. (We selected these words in order to handle any text without first having to sense-tag polysemous words.) The next time the child logged in (typically the next day), the computer presented multiple-choice vocabulary probes. To summarize briefly the form of the experimental trials: • Context – a student using the Reading Tutor encounters a new word in a story, such as “assistance” in: “but no one paid any heed to his cries, nor rendered any assistance.” • Treatment – explain some new words but not others: “Maybe assistance is like aid here. Is it?” • Test – next day, automatically generate a multiple-choice question: “Which do you think means the most like assistance: carrying out; help; saving; line?” In this example, “assistance” was the target word, “aid” was the comparison word, and “help” was the expected answer in the multiple-choice question presented the next day. In some questions, the expected answer was the same as the comparison word presented the day before, and might therefore be easier due to lexical memory effects. We now discuss several aspects of this experiment in more detail. 4.1 Assigning words to conditions for vocabulary assistance For each student, half of the target words were randomly assigned to an “extra help” condition, and the rest of the target words to a control (no extra help) condition. Assignments of words to conditions was done just prior to displaying the sentence with the target word, while the student was reading the story with help from the Reading Tutor. When the student encountered a previously unseen target word, the new word was randomly assigned to either the experimental (context + factoid) or control (context only) condition for that student. By having an openended set of target words instead of a fixed list, we allowed for the addition of new material by teachers, students, or the Project LISTEN team without disrupting the study design. The assignments of words to conditions were intended to persist throughout the student’s history of Reading Tutor use to enable us to look for longer-term effects of multiple exposures to a word. Unfortunately, due to a flaw in the software the assignments were not saved to disk. We therefore analyzed only a student’s first day of experience with a word, and the subsequent vocabulary question. 4.2 Constructing and displaying vocabulary assistance While the student re