Multi-scale modulation filtering in automatic detection of emotions in telephone speech

This study investigates emotion detection from noise-corrupted telephone speech. A generic modulation filtering approach for audio pattern recognition is proposed that utilizes inherent long-term properties of acoustic features in different classes. When applied to binary classification along the activation and valence dimensions, filtering the baseline short-time timbral features in both the training and detection phase leads to significant improvement especially in noise robustness. Automatic selection of training data based on the filter's prediction residual further improves the results.

[1]  Paavo Alku,et al.  Automatic Detection of High Vocal Effort in Telephone Speech , 2012, INTERSPEECH.

[2]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[3]  Tiago H. Falk,et al.  Automatic speech emotion recognition using modulation spectral features , 2011, Speech Commun..

[4]  Hugo Fastl,et al.  Psychoacoustics: Facts and Models , 1990 .

[5]  Winslow Burleson,et al.  Detecting anger in automated voice portal dialogs , 2006, INTERSPEECH.

[6]  Hynek Hermansky,et al.  Robust speaker recognition using spectro-temporal autoregressive models , 2013, INTERSPEECH.

[7]  Chun Chen,et al.  Emotion Recognition from Noisy Speech , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[8]  W. Jesteadt,et al.  Forward masking as a function of frequency, masker level, and signal delay. , 1982, The Journal of the Acoustical Society of America.

[9]  Thomas Gold,et al.  Hearing , 1953, Trans. IRE Prof. Group Inf. Theory.

[10]  Okko Johannes Räsänen,et al.  Feature selection methods and their combinations in high-dimensional classification of speaker likability, intelligibility and personality traits , 2015, Comput. Speech Lang..

[11]  N. Viemeister,et al.  Temporal modulation transfer functions in normal-hearing and hearing-impaired listeners. , 1985, Audiology : official organ of the International Society of Audiology.

[12]  Mohan M. Trivedi,et al.  2010 International Conference on Pattern Recognition Speech Emotion Analysis in Noisy Real-World Environment , 2022 .

[13]  Nikos Fakotakis,et al.  Modeling the Temporal Evolution of Acoustic Parameters for Speech Emotion Recognition , 2012, IEEE Transactions on Affective Computing.

[14]  K. Scherer,et al.  The World of Emotions is not Two-Dimensional , 2007, Psychological science.

[15]  Paavo Alku,et al.  Automatic detection of anger in telephone speech with robust autoregressive modulation filtering , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Björn W. Schuller,et al.  Towards More Reality in the Recognition of Emotional Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[17]  William M. Hartmann,et al.  Psychoacoustics: Facts and Models , 2001 .

[18]  Paavo Alku,et al.  Detection of shouted speech in noise: human and machine. , 2013, The Journal of the Acoustical Society of America.

[19]  Björn W. Schuller,et al.  Paralinguistics in speech and language - State-of-the-art and the challenge , 2013, Comput. Speech Lang..

[20]  Fakhri Karray,et al.  Speech Emotion Recognition using Gaussian Mixture Vector Autoregressive Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[21]  Christopher L. Scofield,et al.  Neural networks and speech processing , 1991, The Kluwer international series in engineering and computer science.

[22]  Tai-Shih Chi,et al.  Robust emotion recognition by spectro-temporal modulation statistic features , 2011, Journal of Ambient Intelligence and Humanized Computing.

[23]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[24]  Jérôme Farinas,et al.  Rhythmic unit extraction and modelling for automatic language identification , 2005, Speech Commun..

[25]  Kjell Elenius,et al.  Automatic recognition of anger in spontaneous speech , 2008, INTERSPEECH.

[26]  Laurence Devillers,et al.  Detection of real-life emotions in call centers , 2005, INTERSPEECH.

[27]  Zhigang Deng,et al.  An acoustic study of emotions expressed in speech , 2004, INTERSPEECH.

[28]  Laurence Devillers,et al.  Real-life emotion-related states detection in call centers: a cross-corpora study , 2010, INTERSPEECH.

[29]  Carlos Busso,et al.  Unveiling the Acoustic Properties that Describe the Valence Dimension , 2012, INTERSPEECH.

[30]  Steven Greenberg,et al.  ON THE ORIGINS OF SPEECH INTELLIGIBILITY IN THE REAL WORLD , 1997 .

[31]  Samy Bengio,et al.  A statistical significance test for person authentication , 2004, Odyssey.

[32]  Okko Räsänen,et al.  Time-frequency integration characteristics of hearing are optimized for perception of speech-like acoustic patterns. , 2013, The Journal of the Acoustical Society of America.

[33]  Carlos Busso,et al.  Shape-based modeling of the fundamental frequency contour for emotion detection in speech , 2014, Comput. Speech Lang..

[34]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[35]  K. Scherer,et al.  Beyond arousal: valence and potency/control cues in the vocal expression of emotion. , 2010, The Journal of the Acoustical Society of America.

[36]  Levent M. Arslan,et al.  Automatic Detection of Anger in Human-Human Call Center Dialogs , 2011, INTERSPEECH.