Anchor Models for Emotion Recognition from Speech

In this paper, we study the effectiveness of anchor models applied to the multiclass problem of emotion recognition from speech. In the anchor models system, an emotion class is characterized by its measure of similarity relative to other emotion classes. Generative models such as Gaussian Mixture Models (GMMs) are often used as front-end systems to generate feature vectors used to train complex back-end systems such as support vector machines (SVMs) or a multilayer perceptron (MLP) to improve the classification performance. We show that in the context of highly unbalanced data classes, these back-end systems can improve the performance achieved by GMMs provided that an appropriate sampling or importance weighting technique is applied. Furthermore, we show that anchor models based on the euclidean or cosine distances present a better alternative to enhance performances because none of these techniques are needed to overcome the problem of skewed data. The experiments conducted on FAU AIBO Emotion Corpus, a database of spontaneous children's speech, show that anchor models improve significantly the performance of GMMs by 6.2 percent relative. We also show that the introduction of within-class covariance normalization (WCCN) improves the performance of the anchor models for both distances, but to a higher extent for euclidean distance for which the results become competitive with cosine distance.

[1]  Maja J. Mataric,et al.  A Framework for Automatic Human Emotion Classification Using Emotion Profiles , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[3]  Delphine Charlet,et al.  Probabilistic anchor models approach for speaker verification , 2005, INTERSPEECH.

[4]  Pierre Dumouchel,et al.  Emotion recognition from children's speech using anchor models , 2012, WOCCI.

[5]  Lukás Burget,et al.  Brno University of Technology system for Interspeech 2009 emotion challenge , 2009, INTERSPEECH.

[6]  Delphine Charlet,et al.  VZ-norm: an extension of z-norm to the multivariate case for anchor model based speaker verification , 2007, INTERSPEECH.

[7]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[8]  Chloé Clavel,et al.  De la construction du corpus émotionnel au système de détection. Le point de vue applicatif de la surveillance dans les lieux publics , 2006, Rev. d'Intelligence Artif..

[9]  Zhaohui Wu,et al.  A Rank based Metric of Anchor Models for Speaker Verification , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[10]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[11]  Douglas E. Sturim,et al.  Speaker indexing in large audio databases using anchor models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[12]  Ching Y. Suen,et al.  Classification of time-series data using a generative/discriminative hybrid , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[13]  Laurence Devillers,et al.  Real-Life Emotion Recognition in Speech , 2007, Speaker Classification.

[14]  Ignacio Lopez-Moreno,et al.  Anchor Model Fusion for Emotion Recognition in Speech , 2009, COST 2101/2102 Conference.

[15]  Pierre Dumouchel,et al.  Weighted Ordered Classes - Nearest Neighbors: A New Framework for Automatic Emotion Recognition from Speech , 2011, INTERSPEECH.

[16]  Robert Sabourin,et al.  A Multi-Classifier System for Off-Line Signature Verification Based on Dissimilarity Representation , 2010, MCS.

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  Pierre Dumouchel,et al.  Anchor Models and WCCN Normalization For Speaker Trait Classification , 2012, INTERSPEECH.

[19]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[20]  Zhaohui Wu,et al.  An UBM-Based Reference Space for Speaker Recognition , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[21]  Delphine Charlet,et al.  Speaker identification by location in an optimal space of anchor models , 2002, INTERSPEECH.

[22]  Wu Li,et al.  Speech Emotion Recognition in E-learning System Based on Affective Computing , 2007, Third International Conference on Natural Computation (ICNC 2007).

[23]  Krishnan Nallaperumal,et al.  A novel adaptive approach to the restoration of digital images corrupted by salt & pepper impulse noise , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[24]  Andreas Stolcke,et al.  Generalized Linear Kernels for One-Versus-All Classification: Application to Speaker Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[25]  Andrew Rosenberg,et al.  Classifying Skewed Data: Importance Weighting to Optimize Average Recall , 2012, INTERSPEECH.

[26]  Zeynep Inanoglu,et al.  Emotive alert: HMM-based emotion detection in voicemail messages , 2005, IUI '05.

[27]  Delphine Charlet,et al.  A correlation metric for speaker tracking using anchor models , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[28]  Stefan Steidl,et al.  Automatic classification of emotion related user states in spontaneous children's speech , 2009 .

[29]  Delphine Charlet,et al.  Speaker identification by anchor models with PCA/LDA post-processing , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[30]  Pierre Dumouchel,et al.  Emotion recognition from speech: WOC-NN and class-interaction , 2012, 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA).

[31]  Rong Yan,et al.  On predicting rare classes with SVM ensembles in scene classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[32]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[33]  Ing-Marie Jonsson,et al.  Performance Analysis of Acoustic Emotion Recognition for In-Car Conversational Interfaces , 2007, HCI.

[34]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[35]  Pierre Dumouchel,et al.  Cepstral and long-term features for emotion recognition , 2009, INTERSPEECH.

[36]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[37]  Foster Provost,et al.  The effect of class distribution on classifier learning , 2001 .

[38]  Yuan Dong,et al.  Svm-Based Speaker Verification by Location in the Space of Reference Speakers , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[39]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  A.R. Panat,et al.  Affective State Analysis of Speech for Speaker Verification: Experimental Study, Design and Development , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).