Exploitation of Phase-Based Features for Whispered Speech Emotion Recognition

Features for speech emotion recognition are usually dominated by the spectral magnitude information while they ignore the use of the phase spectrum because of the difficulty of properly interpreting it. Motivated by recent successes of phase-based features for speech processing, this paper investigates the effectiveness of phase information for whispered speech emotion recognition. We select two types of phase-based features (i.e., modified group delay features and all-pole group delay features), both which have shown wide applicability to all sorts of different speech analysis and are now studied in whispered speech emotion recognition. When exploiting these features, we propose a new speech emotion recognition framework, employing outer product in combination with power and L2 normalization. The according technique encodes any variable length sequence of the phase-based features into a fixed dimension vector regardless of the length of the input sequence. The resulting representation is fed to train a classification model with a linear kernel classifier. Experimental results on the Geneva Whispered Emotion Corpus database, including normal and whispered phonation, demonstrate the effectiveness of the proposed method when compared with other modern systems. It is also shown that, combining phase information with magnitude information could significantly improve performance over the common systems solely adopting magnitude information.

[1]  Sree Hari Krishnan Parthasarathi,et al.  Robustness of phase based features for speaker recognition , 2009, INTERSPEECH.

[2]  Haizhou Li,et al.  Synthetic speech detection using temporal modulation feature , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Themos Stafylakis,et al.  Combining amplitude and phase-based features for speaker verification with short duration utterances , 2015, INTERSPEECH.

[4]  Andrey Temko,et al.  Comparison of Sequence Discriminant Support Vector Machines for Acoustic Event Classification , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[5]  Ning Wang,et al.  Exploitation of phase information for speaker recognition , 2010, INTERSPEECH.

[6]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[7]  Jae S. Lim,et al.  The unimportance of phase in speech enhancement , 1982 .

[8]  Jonathan Le Roux,et al.  Consistent Wiener Filtering for Audio Source Separation , 2013, IEEE Signal Processing Letters.

[9]  Rajesh M. Hegde,et al.  Significance of the Modified Group Delay Feature in Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Mann Oo. Hay Emotion recognition in human-computer interaction , 2012 .

[11]  Peter Vary,et al.  Noise suppression by spectral magnitude estimation —mechanism and theoretical limits— , 1985 .

[12]  Jon Sánchez,et al.  Use of the Harmonic Phase in Speaker Recognition , 2011, INTERSPEECH.

[13]  Baris Bozkurt,et al.  On the use of phase information for speech recognition , 2005, 2005 13th European Signal Processing Conference.

[14]  Björn W. Schuller,et al.  Dimensionality reduction for speech emotion features by multiscale kernels , 2015, INTERSPEECH.

[15]  Thomas Baer,et al.  An fMRI study of whispering: the role of human evolution in psychological dysphonia. , 2011, Medical hypotheses.

[16]  Pierre Dumouchel,et al.  Cepstral and long-term features for emotion recognition , 2009, INTERSPEECH.

[17]  Kristina Simonyan,et al.  Abnormal activation of the primary somatosensory cortex in spasmodic dysphonia: an fMRI study. , 2010, Cerebral cortex.

[18]  Razvan Pascanu,et al.  Combining modality specific deep neural networks for emotion recognition in video , 2013, ICMI '13.

[19]  Björn W. Schuller,et al.  Confidence Measures for Speech Emotion Recognition: A Start , 2012, ITG Conference on Speech Communication.

[20]  Björn W. Schuller,et al.  Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition , 2014, IEEE Signal Processing Letters.

[21]  Timo Gerkmann,et al.  MMSE-Optimal Spectral Amplitude Estimation Given the STFT-Phase , 2013, IEEE Signal Processing Letters.

[22]  Hema A. Murthy,et al.  The modified group delay function and its application to phoneme recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[23]  Elmar Nöth,et al.  The INTERSPEECH 2012 Speaker Trait Challenge , 2012, INTERSPEECH.

[24]  Jonathan Le Roux,et al.  Phase Processing for Single-Channel Speech Enhancement: History and recent advances , 2015, IEEE Signal Processing Magazine.

[25]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[26]  Eduardo Coutinho,et al.  The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language , 2016, INTERSPEECH.

[27]  Rafael A. Calvo,et al.  Affect Detection: An Interdisciplinary Review of Models, Methods, and Their Applications , 2010, IEEE Transactions on Affective Computing.

[28]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[29]  Haizhou Li,et al.  Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition , 2012, INTERSPEECH.

[30]  Longbiao Wang,et al.  Speaker Identification and Verification by Combining MFCC and Phase Information , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[32]  Aleksandr Sizov,et al.  ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge , 2015, INTERSPEECH.

[33]  Pejman Mowlaee,et al.  Iterative Closed-Loop Phase-Aware Single-Channel Speech Enhancement , 2013, IEEE Signal Processing Letters.

[34]  Rainer Martin,et al.  Phase estimation for signal reconstruction in single-channel source separation , 2012, INTERSPEECH.

[35]  Bayya Yegnanarayana,et al.  Waveform estimation using group delay processing , 1985, IEEE Trans. Acoust. Speech Signal Process..

[36]  Kuldip K. Paliwal,et al.  The importance of phase in speech enhancement , 2011, Speech Commun..

[37]  Liang Tao,et al.  Unsupervised learning of phonemes of whispered speech in a noisy environment based on convolutive non-negative matrix factorization , 2014, Inf. Sci..

[38]  Francesco Piazza,et al.  Acoustic template-matching for automatic emergency state detection: An ELM based algorithm , 2015, Neurocomputing.

[39]  Douglas D. O'Shaughnessy,et al.  Multiple windowed spectral features for emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  R. Anitha,et al.  Outerproduct of trajectory matrix for acoustic modeling using support vector machines , 2004, Proceedings of the 2004 14th IEEE Signal Processing Society Workshop Machine Learning for Signal Processing, 2004..

[41]  B. Yegnanarayana Formant extraction from linear‐prediction phase spectra , 1978 .

[42]  Björn Schuller,et al.  Computational Paralinguistics , 2013 .

[43]  Gilles Degottex,et al.  Usual voice quality features and glottal features for emotional valence detection , 2012 .

[44]  Kuldip K. Paliwal,et al.  On the usefulness of STFT phase spectrum in human listening tests , 2005, Speech Commun..

[45]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[46]  B. Yegnanarayana,et al.  Effectiveness of representation of signals Througti group delay functions , 1987 .

[47]  Yannis Stylianou,et al.  INTERSPEECH 2014 Special Session: Phase Importance in Speech Processing Applications , 2014 .

[48]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[49]  Bayya Yegnanarayana,et al.  Speech processing using group delay functions , 1991, Signal Process..

[50]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[51]  Theodoros Iliou,et al.  Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 , 2012, Artificial Intelligence Review.

[52]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[53]  Björn W. Schuller,et al.  Fisher Kernels on Phase-Based Features for Speech Emotion Recognition , 2016, IWSDS.

[54]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[55]  Douglas D. O'Shaughnessy,et al.  Amplitude modulation features for emotion recognition from speech , 2013, INTERSPEECH.

[56]  Björn W. Schuller,et al.  Affect recognition in real-life acoustic conditions - a new perspective on feature selection , 2013, INTERSPEECH.

[57]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[58]  Björn W. Schuller,et al.  Intelligent Audio Analysis , 2013, Signals and communication technology.

[59]  Satoshi Nakamura,et al.  Efficient representation of short-time phase based on group delay , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[60]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[61]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[62]  Tuomas Virtanen,et al.  Automatic recognition of environmental sound events using all-pole group delay features , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[63]  Björn W. Schuller,et al.  The INTERSPEECH 2011 Speaker State Challenge , 2011, INTERSPEECH.

[64]  Sridha Sridharan,et al.  The Delta-Phase Spectrum With Application to Voice Activity Detection and Speaker Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[65]  Toni Heittola,et al.  Modified Group Delay Feature for Musical Instrument Recognition , 2013 .

[66]  Kuldip K. Paliwal,et al.  Product of power spectrum and group delay function for speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[67]  Eduardo Lleida,et al.  Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge , 2015, INTERSPEECH.

[68]  Tetsuji Ogawa,et al.  Bilinear map of filter-bank outputs for DNN-based speech recognition , 2015, INTERSPEECH.

[69]  Paavo Alku,et al.  Using group delay functions from all-pole models for speaker recognition , 2013, INTERSPEECH.

[70]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[71]  Zhao Heming,et al.  A Preliminary Study on Emotions of Chinese Whispered Speech , 2009, 2009 International Forum on Computer Science-Technology and Applications.

[72]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[73]  Joseph B. Walther,et al.  The Impacts of Emoticons on Message Interpretation in Computer-Mediated Communication , 2001 .