An Investigation of Partition-Based and Phonetically-Aware Acoustic Features for Continuous Emotion Prediction from Speech

Phonetic variability has long been considered a confounding factor for emotional speech processing, so phonetic features have been rarely explored. However, surprisingly some features with purely phonetic information have shown state-of-the-art performance for continuous prediction of emotions (e.g., arousal and valence), for which the underlying causes are unknown to date. In this article, we present in-depth investigations into phonetic features on three widely used corpora - RECOLA, SEMAINE and USC CreativeIT - to explore this from two perspectives: acoustic space partitioning information and phonetic content. First, comparisons of multiple different partitioning methods confirm the significance of partitioning information in speech, and reveal the new understanding that varying the number of partitions has a greater effect on valence than arousal prediction: a detailed representation of the acoustic space is needed for valence, whilst a general one is adequate for arousal. Second, phoneme-specific examination of phonetic features suggests that phonetic content is less emotionally informative than partitioning information, and is more important for arousal than for valence. Furthermore, we propose a novel set of phonetically-aware acoustic features, attaining significant improvements for valence (in particular) and arousal prediction across RECOLA, SEMAINE and CreativeIT respectively, compared with conventional reference acoustic features.

[1]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[2]  Antonio Origlia,et al.  Continuous emotion recognition with phonetic syllables , 2014, Speech Commun..

[3]  Ryo Masumura,et al.  Parallel phonetically aware DNNs and LSTM-RNNS for frame-by-frame discriminative modeling of spoken language identification , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Björn W. Schuller,et al.  Cross-language acoustic emotion recognition: An overview and some tendencies , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[5]  Roddy Cowie,et al.  ASR for emotional speech: Clarifying the issues and enhancing performance , 2005, Neural Networks.

[6]  Björn W. Schuller,et al.  Categorical and dimensional affect analysis in continuous input: Current trends and future directions , 2013, Image Vis. Comput..

[7]  Ragini Verma,et al.  Class-level spectral features for emotion recognition , 2010, Speech Commun..

[8]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[9]  Vladimir Pavlovic,et al.  Dynamic Probabilistic CCA for Analysis of Affective Behavior and Fusion of Continuous Annotations , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Lirong Dai,et al.  Deep Bottleneck Features for Spoken Language Identification , 2014, PloS one.

[11]  Fabien Ringeval,et al.  AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data , 2015, AVEC@ACM Multimedia.

[12]  David Philippou-Hübner,et al.  Vowels Formants Analysis Allows Straightforward Detection of High Arousal Acted and Spontaneous Emotions , 2011, INTERSPEECH.

[13]  Emily Mower Provost,et al.  Say Cheese vs. Smile: Reducing Speech-Related Variability for Facial Emotion Recognition , 2014, ACM Multimedia.

[14]  Vidhyasaharan Sethu,et al.  Phonetic and speaker variations in automatic emotion classification , 2008, INTERSPEECH.

[15]  Hichem Sahli,et al.  Relevance Vector Machine Based Speech Emotion Recognition , 2011, ACII.

[16]  Theodoros Iliou,et al.  Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 , 2012, Artificial Intelligence Review.

[17]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[18]  Maja Pantic,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING , 2022 .

[19]  Carlos Busso,et al.  Compensating for speaker or lexical variabilities in speech for emotion recognition , 2014, Speech Commun..

[20]  Ting Dang,et al.  An Investigation of Annotation Delay Compensation and Output-Associative Fusion for Multimodal Continuous Emotion Prediction , 2015, AVEC@ACM Multimedia.

[21]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[22]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[23]  Fabien Ringeval,et al.  At the Border of Acoustics and Linguistics: Bag-of-Audio-Words for the Recognition of Emotions in Speech , 2016, INTERSPEECH.

[24]  Chi-Chun Lee,et al.  A thin-slice perception of emotion? An information theoretic-based framework to identify locally emotion-rich behavior segments for global affect recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[26]  Jean-Philippe Thiran,et al.  Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data , 2015, Pattern Recognit. Lett..

[27]  Mireia Díez Sánchez Frame-level features conveying phonetic information for language and speaker recognition , 2015 .

[28]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[29]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevan e Ve tor Ma hine , 2001 .

[30]  Andreas Wendemuth,et al.  Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications , 2014, Comput. Speech Lang..

[31]  Klaus R. Scherer,et al.  Emotion dimensions and formant position , 2009, INTERSPEECH.

[32]  Michael E. Tipping,et al.  Analysis of Sparse Bayesian Learning , 2001, NIPS.

[33]  Santiago-Omar Caballero-Morales Recognition of Emotions in Mexican Spanish Speech: An Approach Based on Acoustic Modelling of Emotion-Specific Vowels , 2013, TheScientificWorldJournal.

[34]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[35]  Catherine Lai,et al.  RECOGNIZING EMOTIONS IN DIALOGUES WITH DISFLUENCIES AND NON-VERBAL VOCALISATIONS , 2015 .

[36]  Carlos Busso,et al.  The USC CreativeIT database of multimodal dyadic interactions: from speech and full body motion capture to continuous emotional annotations , 2015, Language Resources and Evaluation.

[37]  Fabien Ringeval,et al.  Exploiting a Vowel Based Approach for Acted Emotion Recognition , 2008, COST 2102 Workshop.

[38]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[39]  Zhigang Deng,et al.  Emotion recognition based on phoneme classes , 2004, INTERSPEECH.

[40]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[42]  William M. Campbell,et al.  Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction , 2016, AVEC@ACM Multimedia.

[43]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[44]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[45]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[46]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[47]  Shrikanth S. Narayanan,et al.  Robust Unsupervised Arousal Rating:A Rule-Based Framework withKnowledge-Inspired Vocal Features , 2014, IEEE Transactions on Affective Computing.

[48]  Steve Tauroza,et al.  Speech Rates in British English , 1990 .

[49]  Hatice Gunes,et al.  Output-associative RVM regression for dimensional and continuous emotion prediction , 2011, Face and Gesture 2011.

[50]  Carlos Busso,et al.  Investigating the role of phoneme-level modifications in emotional speech resynthesis , 2005, INTERSPEECH.

[51]  Carlos Busso,et al.  Visual emotion recognition using compact facial representations and viseme information , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[52]  Qin Jin,et al.  Video emotion recognition in the wild based on fusion of multimodal features , 2016, ICMI.

[53]  Mireia Díez,et al.  On the use of phone log-likelihood ratios as features in spoken language recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[54]  Carlos Busso,et al.  Correcting Time-Continuous Emotional Labels by Modeling the Reaction Lag of Evaluators , 2015, IEEE Transactions on Affective Computing.

[55]  Carlos Busso,et al.  Feature and model level compensation of lexical content for facial emotion recognition , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[56]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[57]  Loïc Kessous,et al.  The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals , 2007, INTERSPEECH.

[58]  Zhaocheng Huang,et al.  A PLLR and multi-stage Staircase Regression framework for speech-based emotion prediction , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Björn W. Schuller,et al.  Emotion recognition from speech: Putting ASR in the loop , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[60]  Mohammad Soleymani,et al.  Affective Characterization of Movie Scenes Based on Multimedia Content Analysis and User's Physiological Emotional Responses , 2008, 2008 Tenth IEEE International Symposium on Multimedia.

[61]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[62]  Björn W. Schuller,et al.  The University of Passau Open Emotion Recognition System for the Multimodal Emotion Challenge , 2016, CCPR.

[63]  Athanasios Katsamanis,et al.  Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information , 2013, Image Vis. Comput..

[64]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[65]  Michael E. Tipping,et al.  Fast Marginal Likelihood Maximisation for Sparse Bayesian Models , 2003 .

[66]  Björn Schuller,et al.  Affect-Robust Speech Recognition by Dynamic Emotional Adaptation , 2006 .

[67]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[68]  Johanna D. Moore,et al.  Recognizing emotions in dialogues with acoustic and lexical features , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[69]  Mohamed Chetouani,et al.  Robust continuous prediction of human emotions using multiscale dynamic cues , 2012, ICMI '12.

[70]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[71]  Carlos Busso,et al.  Using Agreement on Direction of Change to Build Rank-Based Emotion Classifiers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[72]  Carlos Busso,et al.  Unveiling the Acoustic Properties that Describe the Valence Dimension , 2012, INTERSPEECH.

[73]  Chengxin Li,et al.  Speech emotion recognition with acoustic and lexical features , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).