Confirmation detection in human-agent interaction using non-lexical speech cues

Even if only the acoustic channel is considered, human communication is highly multi-modal. Non-lexical cues provide a variety of information such as emotion or agreement. The ability to process such cues is highly relevant for spoken dialog systems, especially in assistance systems. In this paper we focus on the recognition of non-lexical confirmations such as "mhm", as they enhance the system's ability to accurately interpret human intent in natural communication. The architecture uses a Support Vector Machine to detect confirmations based on acoustic features. In a systematic comparison, several feature sets were evaluated for their performance on a corpus of human-agent interaction in a setting with naive users including elderly and cognitively impaired people. Our results show that using stacked formants as features yield an accuracy of 84% outperforming regular formants and MFCC or pitch based features for online classification.

[1]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[2]  A. Savitzky,et al.  Smoothing and Differentiation of Data by Simplified Least Squares Procedures. , 1964 .

[3]  F. Harris On the use of windows for harmonic analysis with the discrete Fourier transform , 1978, Proceedings of the IEEE.

[4]  J. F. Kelley,et al.  An iterative design methodology for user-friendly natural language office information applications , 1984, TOIS.

[5]  S. Brennan,et al.  THE FEELING OF ANOTHER'S KNOWING : PROSODY AND FILLED PAUSES AS CUES TO LISTENERS ABOUT THE METACOGNITIVE STATES OF SPEAKERS , 1995 .

[6]  Dafydd Gibbon,et al.  Prosody-particle pairs as discourse control signs , 1997, EUROSPEECH.

[7]  Masataka Goto,et al.  A real-time filled pause detection system for spontaneous speech recognition , 1999, EUROSPEECH.

[8]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[9]  Gisle Andersen,et al.  Pragmatic markers and propositional attitude , 2000 .

[10]  Johan Boye,et al.  Real-time Handling of Fragmented Utterances , 2001 .

[11]  Nigel Ward,et al.  Detecting Filled Pauses in Tutorial Dialogs , 2006 .

[12]  K. Fischer,et al.  Lexical markers of common grounds , 2006 .

[13]  Paul M. Brossier,et al.  Automatic annotation of musical audio for interactive applications , 2006 .

[14]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[15]  Yannis Stylianou,et al.  Video and audio based detection of filled hesitation pauses in classroom lectures , 2009, 2009 17th European Signal Processing Conference.

[16]  Ashish Verma,et al.  Formant-based technique for automatic filled-pause detection in spontaneous spoken english , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Daniel Jurafsky,et al.  Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates , 2010, Speech Commun..

[18]  Sebastian Stüker,et al.  Segmentation of Telephone Speech Based on Speech and Non-speech Models , 2013, SPECOM.

[19]  Xavier Serra,et al.  ESSENTIA: an open-source library for sound and music analysis , 2013, ACM Multimedia.

[20]  Stefan Kopp,et al.  Socially Cooperative Behavior for Artificial Companions for Elderly and Cognitively Impaired People , 2015, ISCT.

[21]  B. Wrede,et al.  Towards Multimodal Perception and Semantic Understanding in a Developmental Model of Speech Acquisition , 2017 .

[22]  LIII , 2018, Out of the Shadow.