Level of interest sensing in spoken dialog using decision-level fusion of acoustic and lexical evidence

Automatic detection of a user's interest in spoken dialog plays an important role in many applications, such as tutoring systems and customer service systems. In this study, we propose a decision-level fusion approach using acoustic and lexical information to accurately sense a user's interest at the utterance level. Our system consists of three parts: acoustic/prosodic model, lexical model, and a model that combines their decisions for the final output. We use two different regression algorithms to complement each other for the acoustic model. For lexical information, in addition to the bag-of-words model, we propose new features including a level-of-interest value for each word, length information using the number of words, estimated speaking rate, silence in the utterance, and similarity with other utterances. We also investigate the effectiveness of using more automatic speech recognition (ASR) hypotheses (n-best lists) to extract lexical features. The outputs from the acoustic and lexical models are combined at the decision level. Our experiments show that combining acoustic evidence with lexical information improves level-of-interest detection performance, even when lexical features are extracted from ASR output with high word error rate.

[1]  Rok Gajsek,et al.  Gender and affect recognition based on GMM and GMM-UBM modeling with relevance MAP estimation , 2010, INTERSPEECH.

[2]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[3]  Alex Pentland,et al.  Coding, Analysis, Interpretation, and Recognition of Facial Expressions , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[5]  Angeliki Metallinou,et al.  Decision level combination of multiple modalities for recognition and analysis of emotional expression , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Carlos Busso,et al.  Visual emotion recognition using compact facial representations and viseme information , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Rui Xia,et al.  Level of interest sensing in spoken dialog using multi-level fusion of acoustic and lexical evidence , 2010, INTERSPEECH.

[8]  Athanasios Katsamanis,et al.  Tracking changes in continuous emotion states using body language and prosodic cues , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Elmar Nöth,et al.  Private emotions versus social interaction: a data-driven approach towards analysing emotion in speech , 2008, User Modeling and User-Adapted Interaction.

[10]  Rui Xia,et al.  Sentence level emotion recognition based on decisions from subsentence segments , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Janyce Wiebe,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2005, HLT.

[12]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Björn W. Schuller,et al.  Emotion recognition using imperfect speech recognition , 2010, INTERSPEECH.

[14]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[15]  Björn W. Schuller,et al.  Audiovisual recognition of spontaneous interest within conversations , 2007, ICMI '07.

[16]  Björn W. Schuller,et al.  Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles , 2005, INTERSPEECH.

[17]  Matthias W. Seeger,et al.  Gaussian Processes For Machine Learning , 2004, Int. J. Neural Syst..

[18]  Yuxiao Hu,et al.  Audio-Visual Spontaneous Emotion Recognition , 2007, Artifical Intelligence for Human Computing.

[19]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[20]  Björn W. Schuller,et al.  OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[21]  Kostas Karpouzis,et al.  Emotion recognition through facial expression analysis based on a neurofuzzy network , 2005, Neural Networks.

[22]  Mohammed Yeasin,et al.  Robust Recognition of Emotion from Speech , 2006, IVA.

[23]  Andreas Stolcke,et al.  Recent innovations in speech-to-text transcription at SRI-ICSI-UW , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[25]  Andreas Stolcke,et al.  Combining Discriminative Feature, Transform, and Model Training for Large Vocabulary Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[26]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[27]  Carlos Busso,et al.  Modeling mutual influence of interlocutor emotion states in dyadic spoken interactions , 2009, INTERSPEECH.

[28]  Elisabeth André,et al.  Lexical Affect Sensing: Are Affect Dictionaries Necessary to Analyze Affect? , 2007, ACII.

[29]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[30]  Björn W. Schuller,et al.  Recognising interest in conversational speech - comparing bag of frames and supra-segmental features , 2009, INTERSPEECH.

[31]  Julia Hirschberg,et al.  Detecting Levels of Interest from Spoken Dialog with Multistream Prediction Feedback and Similarity Based Hierarchical Fusion Learning , 2011, SIGDIAL Conference.

[32]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[33]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[34]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .