Incremental Referential Domain Circumscription during Processing of Natural and Synthesized Speech

Incremental Referential Domain Circumscription during Processing of Natural and Synthesized Speech Mary D. Swift (mswift@ling.rochester.edu) Department of Linguistics, University of Rochester Rochester, NY 14627 Ellen Campana (ecampana@bcs.rochester.edu) Department of Brain and Cognitive Sciences, University of Rochester Rochester, NY 14627 James F. Allen (james@cs.rochester.edu) Department of Computer Sciences, University of Rochester Rochester, NY 14627 Michael K. Tanenhaus (mtan@bcs.rochester.edu) Department of Brain and Cognitive Sciences, University of Rochester Rochester, NY 14627 Abstract We present experimental evidence from a study in which we monitor eye movements as people respond to pre-recorded instructions generated by a human speaker and by two text-to- speech synthesizers. We replicate findings demonstrating that people process spoken language incrementally, making partial commitments as the instruction unfolds. Specifically, they establish different referential domains on the fly depending on whether a definite or indefinite article is used. Importantly, incremental understanding is observed for both natural speech instructions and synthesized text-to-speech instructions. These results, including some suggestive differences in responses with the two text-to-speech systems, establish the potential for using eye-tracking as a new method for fine-grained evaluation of dialogue systems and for using dialogue systems as a theoretical and experimental tool for psycholinguistic experimentation. Background Rapid increases in the accuracy and speed of automatic speech recognition and the increased availability of off-the- shelf text-to-speech systems has fueled great interest in spoken dialogue systems (e.g., Allen, Byron, Dzikovska, Ferguson, Galescu & Stent, 2001; Zue, Seneff, Glass, Polifroni, Pao, Hazen & Hetherington, 2000). As the sophistication of such systems increases, we can expect applications to more open-ended domains with larger vocabularies and more varied utterance types. The feasibility of such systems raises both applied and theoretical issues for work on natural language processing that crosses disciplinary boundaries. We focus on two issues here. The first, a computational issue, addresses the need for developing better evaluation tools for dialogue systems, especially tools that can evaluate comprehension on an utterance-by-utterance and within-utterance basis. The second, a psycholinguistic issue, is the possibility that in the near future implemented dialogue systems could serve as a powerful tool for developing and testing psycholinguistic models by allowing stimuli to be generated ‘on the fly,’ conditioned on the current state of the discourse. A necessary prerequisite for enabling both of these goals is that people respond to synthesized speech in much the same way as they do to natural speech. We present experimental evidence from a study in which we monitor eye movements as people respond to pre-recorded instructions generated by a human speaker and by two text- to-speech synthesizers. We replicate findings demonstrating that people process spoken language incrementally, making partial commitments as the instruction unfolds. More specifically, listeners establish referential domains on the fly depending on whether a definite or indefinite article is used. Eye movements as an evaluation tool Spoken utterances unfold over time, resulting in a stream of temporary ambiguities. For example, as the instruction Click on the beaker unfolds, the word beaker is briefly consistent with multiple candidates, including beetle, beeper, and speaker. Numerous psycholinguistic studies demonstrate that people comprehend utterances continuously, entertaining multiple lexical candidates (e.g., Marslen- Wilson, 1987), making provisional commitments at points of syntactic ambiguity, and resolving reference incrementally (e.g., Altmann, 1998; Tanenhaus & Trueswell, 1995). Recent studies using eye movements to a task-relevant object in a visual workspace as people follow spoken instructions provide striking evidence for both incremental understanding and rapid integration of multiple constraints (Tanenhaus, Spivey-Knowlton, Eberhard & Sedivy, 1995; 1996; Tanenhaus, Magnuson & Chambers, forthcoming). For example, if the instruction Click on a beaker is presented in a context in which there are two icons of beakers and two icons of beetles, then reference will be delayed until the word beaker is disambiguated phonetically