Predicting hyperarticulate speech during human-computer error resolution

Abstract When speaking to interactive systems, people sometimes hyperarticulate — or adopt a clarified form of speech that has been associated with increased recognition errors. The goals of the present study were (1) to establish a flexible simulation method for studying users' reactions to system errors, (2) to analyze the type and magnitude of linguistic adaptations in speech during human-computer error resolution, (3) to provide a unified theoretical model for interpreting and predicting users' spoken adaptations during system error handling, and (4) to outline the implications for developing more robust interactive systems. A semi-automatic simulation method with a novel error generation capability was developed to compare users' speech immediately before and after system recognition errors, and under conditions varying in error base-rate. Matched original-repeat utterance pairs then were analyzed for type and magnitude of linguistic adaptation. When resolving errors with a computer, it was revealed that users actively tailor their speech along a spectrum of hyperarticulation, and as a predictable reaction to their perception of the computer as an “at risk” listener. During both low and high error rates, durational changes were pervasive, including elongation of the speech segment and large relative increases in the number and duration of pauses. During a high error rate, speech also was adapted to include more hyper-clear phonological features, fewer disfluencies, and change in fundamental frequency. The two-stage CHAM model (Computer-elicited Hyperarticulate Adaptation Model) is proposed to account for these changes in users' speech during interactive error resolution.

[1]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Sharon L. Oviatt,et al.  Error resolution during multimodal human-computer interaction , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  B. Lindblom,et al.  Role of articulation in speech perception: clues from production. , 1996, The Journal of the Acoustical Society of America.

[4]  J. M. Pickett,et al.  Effects of Vocal Force on the Intelligibility of Speech Sounds , 1956 .

[5]  Anne Cutler,et al.  Durational cues to word boundaries in clear speech , 1990, Speech Commun..

[6]  D B Pisoni,et al.  Effects of cognitive workload on speech production: acoustic analyses and perceptual consequences. , 1993, The Journal of the Acoustical Society of America.

[7]  S Oviatt,et al.  Modeling global and focal hyperarticulation during human-computer error resolution. , 1998, The Journal of the Acoustical Society of America.

[8]  R. Gelman,et al.  The development of communication skills: modifications in the speech of young children as a function of listener. , 1973, Monographs of the Society for Research in Child Development.

[9]  Sharon L. Oviatt,et al.  Unification-based Multimodal Integration , 1997, ACL.

[10]  I. Titze,et al.  Vocal fold physiology : biomechanics, acoustics, and phonatory control , 1983 .

[11]  B. Maassen,et al.  Marking word boundaries to improve the intelligibility of the speech of the deaf. , 1986, Journal of speech and hearing research.

[12]  Frederick Jelinek,et al.  The development of an experimental discrete dictation recognizer , 1985, Proceedings of the IEEE.

[13]  Elizabeth Shriberg,et al.  Human-Machine Problem Solving Using Spoken Language Systems (SLS): Factors Affecting Performance and User Satisfaction , 1992, HLT.

[14]  Sharon L. Oviatt,et al.  Toward interface design for human language technology: Modality and structure as determinants of linguistic complexity , 1994, Speech Communication.

[15]  Anne Cutler,et al.  Word boundary cues in clear speech: A supplementary report , 1991, Speech Commun..

[16]  D. Bouwhuis,et al.  Melodic cues to the perceived ‘‘finality’’ of utterances , 1994 .

[17]  L D Braida,et al.  Intelligibility of conversational and clear speech in noise and reverberation for listeners with normal and impaired hearing. , 1994, The Journal of the Acoustical Society of America.

[18]  Sharon L. Oviatt,et al.  A rapid semi-automatic simulation technique for investigating interactive speech and handwriting , 1992, ICSLP.

[19]  Donald A. Norman,et al.  User Centered System Design , 1986 .

[20]  Catalina Danis Developing Successful Speakers for an Automatic Speech Recognition System , 1989 .

[21]  N I Durlach,et al.  Speaking clearly for the hard of hearing I: Intelligibility differences between clear and conversational speech. , 1985, Journal of speech and hearing research.

[22]  Francine R. Chen,et al.  Acoustic characteristics and intelligibility of clear and conversational speech at the segmental level , 1980 .

[23]  Gina-Anne Levow,et al.  Designing SpeechActs: issues in speech user interfaces , 1995, CHI '95.

[24]  A. Fernald,et al.  A cross-language study of prosodic modifications in mothers' and fathers' speech to preverbal infants , 1989, Journal of Child Language.

[25]  Kathryn M. Dobroth,et al.  Automating Services with Speech Recognition over the Public Switched Telephone Network: Human Factors Considerations , 1991, IEEE J. Sel. Areas Commun..

[26]  Christoph Draxler,et al.  Consistency of judgements in manual labelling of phonetic segments: the distinction between clear and unclear cases , 1992, ICSLP.

[27]  T. D. Hanley,et al.  Effect of level of distracting noise upon speaking rate, duration and intensity. , 1949, The Journal of speech disorders.

[28]  C. A. Ferguson,et al.  Talking to Children: Language Input and Acquisition , 1979 .

[29]  S Oviatt,et al.  Linguistic Adaptations During Spoken and Multimodal Error Resolution , 1998, Language and speech.

[30]  Sharon L. Oviatt,et al.  Integration themes in multimodal human-computer interaction , 1994, ICSLP.

[31]  M. Picheny,et al.  Speaking clearly for the hard of hearing. II: Acoustic characteristics of clear and conversational speech. , 1986, Journal of speech and hearing research.

[32]  Björn Lindblom,et al.  Speech transforms , 1992, Speech Commun..

[33]  K. Scherer,et al.  Effect of experimentally induced stress on vocal parameters. , 1986, Journal of experimental psychology. Human perception and performance.

[34]  Sharon L. Oviatt,et al.  Predicting spoken disfluencies during human-computer interaction , 1995, Comput. Speech Lang..

[35]  Björn Granström,et al.  Towards an enhanced prosodic model adapted to dialogue applications , 1995 .

[36]  Jay G. Wilpon,et al.  Voice communication between humans and machines , 1994 .

[37]  Sharon L. Oviatt,et al.  Ten myths of multimodal interaction , 1999, Commun. ACM.

[38]  C Kamm,et al.  User Interfaces for voice applications , 1994 .

[39]  Catherine G. Wolf Understanding Handwriting Recognition from the User's Perspective , 1990 .

[40]  R. Schulman,et al.  Articulatory dynamics of loud and normal speech. , 1989, The Journal of the Acoustical Society of America.

[41]  Clifford Nass,et al.  Computers are social actors , 1994, CHI '94.

[42]  C. A. Ferguson,et al.  Talking to Children , 1977 .

[43]  A. Caramazza,et al.  Lexical organization of nouns and verbs in the brain , 1991, Nature.

[44]  Andre Malecot,et al.  The Role of Releases in the Identification of Released Final Stops: A Series of Tape-Cutting Experiments , 1958 .

[45]  R. H. Bernacki,et al.  Effects of noise on speech production: acoustic and perceptual analyses. , 1988, The Journal of the Acoustical Society of America.

[46]  Judith Spitz Collection and Analysis of Data from Real Users: Implications for Speech Recognition/Understanding Systems , 1991, HLT.

[47]  Clive Frankish,et al.  Recognition accuracy and user acceptance of pen interfaces , 1995, CHI '95.

[48]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[49]  L. Braida,et al.  Speaking clearly for the hard of hearing IV: Further studies of the role of speaking rate. , 1996, Journal of speech and hearing research.

[50]  Donald A. Norman,et al.  Designing for error , 1987 .

[51]  S. Gordon-Salant,et al.  Effects of acoustic modification on consonant recognition by elderly hearing-impaired subjects. , 1987, The Journal of the Acoustical Society of America.

[52]  B. Lindblom,et al.  Interaction between duration, context, and speaking style in English stressed vowels , 1994 .

[53]  Eric Fosler-Lussier,et al.  Towards robustness to fast speech in ASR , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[54]  Zinny S. Bond,et al.  A note on the acoustic-phonetic characteristics of inadvertently clear speech , 1994, Speech Commun..

[55]  Björn Lindblom,et al.  Explaining Phonetic Variation: A Sketch of the H&H Theory , 1990 .