On Understanding Discourse in Human-Computer Interaction Paul P. Maglio (pmaglio@almaden.ibm.com) Teenie Matlock (tmatlock@psych.stanford.edu) Sydney J. Gould (sydneygould@hotmail.com) Dave Koons (dkoons@almaden.ibm.com) Christopher S. Campbell (ccampbel@almaden.ibm.com) IBM Almaden Research Center 650 Harry Rd, B2-NWE San Jose, CA 95120 USA Abstract We report on an experiment that investigated how people naturally communicate with computational devices using speech and gaze. Our approach follows from the idea that human-human conversation involves the establishment of common ground, the use of gaze direction to indicate attention and turn-taking, and awareness of other’s knowledge and abilities. Our goal is to determine whether it is easier to communicate with several devices, each with its own specialized functions and abilities, or with a single system that can control several devices. If conversations with devices resemble conversations with people, we would expect interaction with several devices to require extra effort—both in building common ground and in specifying turn-taking. To test this, we observed participants in an office mock- up where information was accessed on displays through speech input only. Between groups, we manipulated what participants were told: in one case, that they were speaking to a single controlling system, and in the other, that they were speaking to a set of individually controlled devices. Based on language use and gaze patterns, our results suggest that the office environment was more efficient and easier to use when participants believed they were talking to a single system than when they believed they were talking to a several devices. Introduction One approach to human computer interaction is to improve the usability, user experience, and intuitiveness of technology by creating natural user interfaces. Here, natural refers to interactions that are like those people have with one another. Such is the goal of multimodal or attentive systems (Maglio, Matlock, Campbell, Zhai & Smith, 2000; Oviatt & Cohen, 2000), and speech and conversational interfaces (Maybury, Understanding cues in conversation, language use, perceptual abilities, and expectations is vital to building systems that can be used with little training. Advances in technology are resulting in smaller, cheaper, and more pervasive computational systems than ever before. But are we ready for this surge of electronics and information? No longer confined to desktop or laptop machines, computational systems will soon extend across numerous “information appliances” that are specialized for individual jobs and embedded in the everyday environment (Norman, 1998). If point- and-click graphical user interfaces (GUI) have enabled wide use of PCs, what will be the paradigm for interaction with pervasive computing systems? As natural human-computer interfaces and pervasive systems converge, what form will technology take? To address these questions, we explored the design of a pervasive system with speech input in an office setting. We were concerned specifically with conversational cues that people rely on when interacting with the system. Some evidence suggests that people can attribute human-like or social qualities to computers with which they interact; for instance, networked computers described as physically close to the user are judged as more helpful than those described as physically distant (Reeves & Nass, 1996). Although people do not treat computers as true conversational partners (Yankelovich, Levow & Marx, 1995), these sorts of results suggest that people apply natural ways of interacting to situations in which the conversational partner is a computer or other computational device. Our main concern is whether it is easier for people to talk to a single system or to a collection of devices. In a previous study of a speech-controlled office, we found behaviors and attitudes depended on whether users received simple command recognition feedback (a blinking light) from the various devices that performed tasks or from a single, central location (Maglio, Matlock, Campbell, Zhai & Smith, 2000; Matlock, Campbell, Maglio, Zhai & Smith, 2001). In that study, users were faced with simple office tasks (such as looking up information, dictating a letter, and printing a letter) to be completed using speech input only. To do this, users were given a set of physical displays dedicated to various functions (such as address book, calendar, and so on). Between groups of participants, we manipulated whether feedback was associated with individual displays or with the room as whole. This feedback manipulation was meant to suggest either central control or distributed control. Behaviorally, we found that regardless of condition, participants rarely addressed individual devices verbally, but they looked at the devices that they expected to display the results
[1]
Manuel A. Pérez-Quiñones,et al.
A collaborative model of feedback in human-computer interaction
,
1996,
CHI.
[2]
M. Argyle,et al.
Gaze and Mutual Gaze
,
1994,
British Journal of Psychiatry.
[3]
M. T. Maybury,et al.
Conversational Multimedia Interaction
,
1999
.
[4]
A. Kendon.
Some functions of gaze-direction in social interaction.
,
1967,
Acta psychologica.
[5]
Arne Jönsson,et al.
Wizard of Oz studies: why and how
,
1993,
IUI '93.
[6]
Philip R. Cohen,et al.
Referring as a Collaborative Process
,
2003
.
[7]
Shumin Zhai,et al.
Gaze and Speech in Attentive User Interfaces
,
2000,
ICMI.
[8]
Yorick Wilks.
Machine Conversations
,
1999
.
[9]
Nigel Ward,et al.
Responding to subtle, fleeting changes in the user's internal state
,
2001,
CHI.
[10]
Donald A. Norman,et al.
The invisible computer
,
1998
.
[11]
Gina-Anne Levow,et al.
Designing SpeechActs: issues in speech user interfaces
,
1995,
CHI '95.
[12]
John D. Gould,et al.
Composing letters with a simulated listening typewriter
,
1982,
CHI '82.
[13]
Shumin Zhai,et al.
Designing Feedback for an Attentive Office
,
2001,
INTERACT.
[14]
Philip R. Cohen,et al.
MULTIMODAL INTERFACES THAT PROCESS WHAT COMES NATURALLY
,
2000
.