Automatically detected acoustic landmarks for assessing natural emotion from speech

This Master thesis focuses on the field of emotion recognition from speech, for which the interest has grown considerably during the past ten years with more than 100 papers per year since 2004. Today, sophisticated and relevant methods to extract features from speech permitting to analyse emotions are available. Although these methods give convincing results on acted emotions, such emotions are not really natural and used techniques are thus not applicable in the real world. Current techniques mainly focus on acoustic speech features, e.g., pitch, energy, voice quality, and lead to poor performances when they are applied on spontaneous and natural recording of speech. Thus, the present challenge for this field resides in the search for relevant techniques permitting to a robust emotion recognition from spontaneous speech recordings. A possible solution would consist in exploiting linguistic features that are being used for the emotion recognition task. However, to enabling future non-human interaction partners, the use of linguistic features has to be done in an automatic way, as it is the case for the analysis of the acoustic features. Until now, only few studies focused on emotion recognition through linguistic features. Some used automatic speech recognition (ASR) systems output to extract features rather than manual annotation of the data. Results show that experiments based on ASR output lead to lower performances than experiments based on manual annotation. This comes from the fact that ASR is likely to introduce inaccuracies in the transcriptions, because the emotion introduces significant variabilities. This Master thesis investigates new techniques for linguistic features extraction from speech with the aim to achieve emotion recognition without using ASR systems. Thus, instead of using word transcription from ASR systems, our technique exploits acoustic landmarks, i.e., events correlated with changes in speech production and perception, which are automatically detectable in the speech signal. We compared emotion recognition performance of manually transcribed words and automatically detected acoustic landmarks on the same corpus (SEMAINE), using state-of-the-art methods for features extraction, e.g., Bag of words and n-grams. Results show that considering simple landmarks as voiced/unvoiced segments of speech already leads to better performance than using manually transcribed words. When considering vowels/consonant landmarks or the p-center landmark alone, i.e., rhythmic events, no improvement can be observed. Whereas best scores are obtained on the fusion of voiced/unvoiced landmarks with p-centers. Although the difference between scores obtained on words and those from the fusion of voiced/unvoiced landmarks and p-centers is not significant, this work shows the interest of working with acoustic landmarks. Indeed, by automatically extracting linguistic features through the acoustic signal, we are able to overcome the need of working with ASR systems, while improving scores.

[1]  T. Newcomb An approach to the study of communicative acts. , 1953, Psychological review.

[2]  John H. L. Hansen,et al.  Speech under stress conditions: overview of the effect on speech production and on system performance , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[3]  Julia Hirschberg,et al.  Predicting Automatic Speech Recognition Performance Using Prosodic Cues , 2000, ANLP.

[4]  Johann Graf Lambsdorff,et al.  An Empirical Approach , 2002 .

[5]  Yannick Estève Intégration de sources de connaissances pour la modélisation stochastique du langage appliquée à la parole continue dans un contexte de dialogue oral homme-machine , 2002 .

[6]  S. Sathiya Keerthi,et al.  Improvements to the SMO algorithm for SVM regression , 2000, IEEE Trans. Neural Networks Learn. Syst..

[7]  Fabien Ringeval,et al.  Ancrages et modèles dynamiques de la prosodie : application à la reconnaissance des émotions actées et spontanées. (Speech anchor and dynamic models of prosody : application to acted and spontaneous emotion recognition) , 2011 .

[8]  Björn W. Schuller,et al.  On the Impact of Children's Emotional Speech on Acoustic and Language Models , 2010, EURASIP J. Audio Speech Music. Process..

[9]  Diana Inkpen,et al.  Prior versus Contextual Emotion of a Word in a Sentence , 2012, WASSA@ACL.

[10]  Sungzoon Cho,et al.  epsilon-Tube Based Pattern Selection for Support Vector Machines , 2006, PAKDD.

[11]  K E Cummings,et al.  Analysis of the glottal excitation of emotionally styled and stressed speech. , 1995, The Journal of the Acoustical Society of America.

[12]  A. Ortony,et al.  What's basic about basic emotions? , 1990, Psychological review.

[13]  Robert F. Port,et al.  Rhythmic constraints on stress timing in English , 1998 .

[14]  Janyce Wiebe,et al.  Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis , 2005, HLT.

[15]  Björn W. Schuller,et al.  Recognition of interest in human conversational speech , 2006, INTERSPEECH.

[16]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[17]  Sadaoki Furui Selected topics from 40 years of research on speech and speaker recognition , 2009, INTERSPEECH.

[18]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[19]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[20]  K. Scherer,et al.  The World of Emotions is not Two-Dimensional , 2007, Psychological science.

[21]  Kornel Laskowski,et al.  Combining Efforts for Improving Automatic Classification of Emotional User States , 2006 .

[22]  Björn W. Schuller,et al.  AVEC 2012: the continuous audio/visual emotion challenge , 2012, ICMI '12.

[23]  Athanasios Katsamanis,et al.  Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information , 2013, Image Vis. Comput..

[24]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[25]  Kenneth N Stevens,et al.  Toward a model for lexical access based on acoustic landmarks and distinctive features. , 2002, The Journal of the Acoustical Society of America.

[26]  Björn W. Schuller,et al.  LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework , 2013, Image Vis. Comput..

[27]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[28]  Björn W. Schuller,et al.  “The Godfather” vs. “Chaos”: Comparing Linguistic Analysis Based on On-line Knowledge Sources and Bags-of-N-Grams for Movie Review Valence Estimation , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[29]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[30]  Roddy Cowie,et al.  FEELTRACE: an instrument for recording perceived emotion in real time , 2000 .

[31]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[32]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[33]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[34]  I. D. Gates,et al.  Support vector regression for porosity prediction in a heterogeneous reservoir: A comparative study , 2010, Comput. Geosci..

[35]  Maja Pantic,et al.  The SEMAINE corpus of emotionally coloured character interactions , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[36]  Fabien Ringeval,et al.  Hilbert-Huang Transform for Non-Linear Characterization of Speech Rhythm , 2009 .

[37]  Roddy Cowie,et al.  Describing the emotional states that are expressed in speech , 2003, Speech Commun..

[38]  R. Plutchik Emotion, a psychoevolutionary synthesis , 1980 .

[39]  Roddy Cowie,et al.  ASR for emotional speech: Clarifying the issues and enhancing performance , 2005, Neural Networks.

[40]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[41]  Björn W. Schuller,et al.  Recognizing Affect from Linguistic Information in 3D Continuous Space , 2011, IEEE Transactions on Affective Computing.

[42]  Peter D. Turney,et al.  Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon , 2010, HLT-NAACL 2010.

[43]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[44]  François Pellegrino Une approche phonétique en identification automatique des langues : la modélisation acoustique des systèmes vocaliques , 1998 .

[45]  Lori Lamel,et al.  Challenges in real-life emotion annotation and machine learning based detection , 2005, Neural Networks.

[46]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[47]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[48]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[49]  P. Ekman An argument for basic emotions , 1992 .

[50]  Maja Pantic,et al.  Editorial Proc of The Second International Audio/Visual Emotion Challenge and Workshop - An Introduction , 2012 .

[51]  Janyce Wiebe,et al.  Effects of Adjective Orientation and Gradability on Sentence Subjectivity , 2000, COLING.

[52]  Mitsuru Ishizuka,et al.  Recognition of Affect, Judgment, and Appreciation in Text , 2010, COLING.

[53]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[54]  Mitsuru Ishizuka,et al.  Assessing Sentiment of Text by Semantic Dependency and Contextual Valence Analysis , 2007, ACII.

[55]  A. Tversky Intransitivity of preferences. , 1969 .

[56]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[57]  Sungzoon Cho,et al.  ε-Tube based pattern selection for support vector machines , 2006 .

[58]  Padraig Cunningham,et al.  Overfitting in Wrapper-Based Feature Subset Selection: The Harder You Try the Worse it Gets , 2004, SGAI Conf..

[59]  Carlo Strapparava,et al.  WordNet Affect: an Affective Extension of WordNet , 2004, LREC.

[60]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[61]  Sam Tilsen,et al.  Low-frequency Fourier analysis of speech rhythm. , 2008, The Journal of the Acoustical Society of America.

[62]  R. Jakobson Closing Statement: Linguistics and Poetics , 2006 .

[63]  Björn W. Schuller,et al.  Emotion recognition from speech: Putting ASR in the loop , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[64]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[65]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[66]  Björn W. Schuller,et al.  Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles , 2005, INTERSPEECH.

[67]  Thomas P. Minka,et al.  Gates , 2008, NIPS.

[68]  Alexander Franz Automatic Ambiguity Resolution in Natural Language Processing: An Empirical Approach , 1996 .

[69]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.