Over-sampling Emotional Speech Data Based on Subjective Evaluations Provided by Multiple Individuals

A common step in the area of speech emotion recognition is to obtain ground-truth labels describing the emotional content of a sentence. The underlying emotion of a given recording is usually unknown, so perceptual evaluations are conducted to annotate its perceived emotion. Each sentence is often annotated by multiple raters, which are aggregated with methods such as majority vote rules. This paper argues that several labels provided by different individuals convey more information than the consensus labels. We demonstrate that leveraging the information provided by separate evaluations collected by multiple raters can help in building more robust classifiers which maximize the utilization of labeled data. Motivated by the synthetic minority over-sampling technique(SMOTE), we present a novel over-sampling approach during training, where the samples with categorical emotion labels are over-sampled according to the labels assigned by multiple individuals. This approach (1)increases the number of sentences from classes with underrepresented consensus labels, and (2)utilizes sentences with ambiguous emotional content even if they do not reach consensus agreement. The experimental evaluation shows the benefits of the approach over a baseline classifier trained with consensus labels, which increases the F1-score by 5.2% (absolute) for the USC-IEMOCAP corpus, and 5.4% (absolute) for the MSP-IMPROV corpus.

[1]  Elmar Nöth,et al.  "Of all things the measure is man" automatic classification of emotions and inter-labeler consistency [speech-based emotion recognition] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  Carlos Busso,et al.  A Stepwise Analysis of Aggregated Crowdsourced Labels Describing Multimodal Emotional Behaviors , 2017, INTERSPEECH.

[3]  Xiaodong Cui,et al.  Data Augmentation for Deep Neural Network Acoustic Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Carlos Busso,et al.  Jointly Predicting Arousal, Valence and Dominance with Multi-Task Learning , 2017, INTERSPEECH.

[5]  Elisabeth André,et al.  Exploring the benefits of discretization of acoustic features for speech emotion recognition , 2009, INTERSPEECH.

[6]  Andrew Rosenberg,et al.  Classifying Skewed Data: Importance Weighting to Optimize Average Recall , 2012, INTERSPEECH.

[7]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[8]  Reza Lotfian,et al.  Formulating emotion perception as a probabilistic model with application to categorical emotion classification , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[9]  Carlos Busso,et al.  Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings , 2019, IEEE Transactions on Affective Computing.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  T. Kailath The Divergence and Bhattacharyya Distance Measures in Signal Selection , 1967 .

[12]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[13]  Maja J. Mataric,et al.  A Framework for Automatic Human Emotion Classification Using Emotion Profiles , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[15]  Carlos Busso,et al.  Increasing the Reliability of Crowdsourcing Evaluations Using Online Quality Assessment , 2016, IEEE Transactions on Affective Computing.

[16]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[17]  Honglak Lee,et al.  Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Recording audio-visual emotional databases from actors : a closer look , 2008 .

[19]  Michael Picheny,et al.  The metamorphic algorithm: a speaker mapping approach to data augmentation , 1994, IEEE Trans. Speech Audio Process..

[20]  Björn W. Schuller,et al.  Learning with synthesized speech for automatic emotion recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Chao Gao,et al.  Minimax Optimal Convergence Rates for Estimating Ground Truth from Crowdsourced Labels , 2013, 1310.5764.

[22]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[23]  Raja Noor Ainon,et al.  Speech emotion detection based on neural networks , 2007, 2007 9th International Symposium on Signal Processing and Its Applications.

[24]  K. Fischer,et al.  DESPERATELY SEEKING EMOTIONS OR: ACTORS, WIZARDS, AND HUMAN BEINGS. , 2000 .

[25]  Roddy Cowie,et al.  Multimodal databases of everyday emotion: facing up to complexity , 2005, INTERSPEECH.

[26]  Emily Mower Provost,et al.  Using regional saliency for speech emotion recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Klaus R. Scherer,et al.  Lost Luggage: A Field Study of Emotion–Antecedent Appraisal , 1997 .

[28]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[29]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[30]  Dirk Van den Poel,et al.  Handling class imbalance in customer churn prediction , 2009, Expert Syst. Appl..

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Carlos Busso,et al.  Study of Dense Network Approaches for Speech Emotion Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[34]  Ling Guan,et al.  A neural network approach for human emotion recognition in speech , 2004, 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512).

[35]  Carlos Busso,et al.  MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception , 2017, IEEE Transactions on Affective Computing.

[36]  Lori Lamel,et al.  Challenges in real-life emotion annotation and machine learning based detection , 2005, Neural Networks.

[37]  Margaret Lech,et al.  Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[38]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[39]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[40]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[41]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[42]  Reza Lotfian,et al.  Retrieving Categorical Emotions Using a Probabilistic Framework to Define Preference Learning Samples , 2016, INTERSPEECH.

[43]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[44]  Richard F. Lyon,et al.  Effective Training of a Neural Network Character Classifier for Word Recognition , 1996, NIPS.

[45]  Carlos Busso,et al.  Interpreting ambiguous emotional expressions , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[46]  Pearl Pu,et al.  Prediction of Helpful Reviews Using Emotions Extraction , 2014, AAAI.

[47]  Shrikanth S. Narayanan,et al.  Detecting emotional state of a child in a conversational computer game , 2011, Comput. Speech Lang..

[48]  Gerald M. Knapp,et al.  Dimensionality Reduction and Classification Analysis on the Audio Section of the SEMAINE Database , 2011, ACII.

[49]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[50]  Mark J. F. Gales,et al.  Data augmentation for low resource languages , 2014, INTERSPEECH.

[51]  Pierre Dumouchel,et al.  Anchor Models for Emotion Recognition from Speech , 2013, IEEE Transactions on Affective Computing.

[52]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[53]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[54]  Roddy Cowie,et al.  Beyond emotion archetypes: Databases for emotion modelling using neural networks , 2005, Neural Networks.

[55]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[56]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..