Pitch-synchronous single frequency filtering spectrogram for speech emotion recognition

Convolutional neural networks (CNN) are widely used for speech emotion recognition (SER). In such cases, the short time fourier transform (STFT) spectrogram is the most popular choice for representing speech, which is fed as input to the CNN. However, the uncertainty principles of the short-time Fourier transform prevent it from capturing time and frequency resolutions simultaneously. On the other hand, the recently proposed single frequency filtering (SFF) spectrogram promises to be a better alternative because it captures both time and frequency resolutions simultaneously. In this work, we explore the SFF spectrogram as an alternative representation of speech for SER. We have modified the SFF spectrogram by taking the average of the amplitudes of all the samples between two successive glottal closure instants (GCI) locations. The duration between two successive GCI locations gives the pitch, motivating us to name the modified SFF spectrogram as pitch-synchronous SFF spectrogram. The GCI locations were detected using zero frequency filtering approach. The proposed pitch-synchronous SFF spectrogram produced accuracy values of 63.95% (unweighted) and 70.4% (weighted) on the IEMOCAP dataset. These correspond to an improvement of + 7.35% (unweighted) and + 4.3% (weighted) over state-of-the-art result on the STFT sepctrogram using CNN. Specially, the proposed method recognized 22.7% of the happy emotion samples correctly, whereas this number was 0% for state-of-the-art results. These results also promise a much wider use of the proposed pitch-synchronous SFF spectrogram for other speech-based applications.

[1]  Sung Wook Baik,et al.  Deep features-based speech emotion recognition for smart affective services , 2017, Multimedia Tools and Applications.

[2]  Shrikanth S. Narayanan,et al.  Automatic speaker age and gender recognition using acoustic and prosodic level information fusion , 2013, Comput. Speech Lang..

[3]  Jianyou Wang,et al.  Speech Emotion Recognition with Dual-Sequence LSTM Architecture , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Jithendra Vepa,et al.  Speech Emotion Recognition Using Spectrogram & Phoneme Embedding , 2018, INTERSPEECH.

[5]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[6]  Jing Yang,et al.  3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition , 2018, IEEE Signal Processing Letters.

[7]  Bayya Yegnanarayana,et al.  Spectro-temporal analysis of speech signals using zero-time windowing and group delay function , 2013, Speech Commun..

[8]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[9]  S. R. Mahadeva Prasanna,et al.  Epoch Extraction From Telephone Quality Speech Using Single Pole Filter , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Pao-Chi Chang,et al.  Spectral-temporal receptive fields and MFCC balanced feature extraction for robust speaker recognition , 2016, Multimedia Tools and Applications.

[11]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[12]  Shubha Kadambe,et al.  Application of the wavelet transform for pitch detection of speech signals , 1992, IEEE Trans. Inf. Theory.

[13]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[14]  Sung Wook Baik,et al.  Gender Identification using MFCC for Telephone Applications - A Comparative Study , 2016, ArXiv.

[15]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Mark A. Kramer,et al.  Improvement of the backpropagation algorithm for training neural networks , 1990 .

[17]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Bayya Yegnanarayana,et al.  Robust Estimation of Fundamental Frequency Using Single Frequency Filtering Approach , 2016, INTERSPEECH.

[20]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[21]  Ron Hoory,et al.  Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms , 2017, INTERSPEECH.

[22]  Isabel Trancoso,et al.  Age and gender classification using fusion of acoustic and prosodic features , 2010, INTERSPEECH.

[23]  K. Sreenivasa Rao,et al.  Epoch detection from emotional speech signal using zero time windowing , 2018, Speech Commun..

[24]  Tatsuya Kawahara,et al.  Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning , 2019, INTERSPEECH.

[25]  Margaret Lech,et al.  Evaluating deep learning architectures for Speech Emotion Recognition , 2017, Neural Networks.

[26]  Björn W. Schuller,et al.  Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Zhang Yi,et al.  Spectrogram based multi-task audio classification , 2017, Multimedia Tools and Applications.

[28]  Thamer Alhussain,et al.  Speech Emotion Recognition Using Deep Learning Techniques: A Review , 2019, IEEE Access.

[29]  Sung Wook Baik,et al.  Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network , 2017, 2017 International Conference on Platform Technology and Service (PlatCon).

[30]  Bayya Yegnanarayana,et al.  Single Frequency Filtering Approach for Discriminating Speech and Nonspeech , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[32]  Gudrun Klasmeyer,et al.  The perceptual importance of selected voice quality parameters , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  D. Mitchell Wilkes,et al.  Acoustical properties of speech as indicators of depression and suicidal risk , 2000, IEEE Transactions on Biomedical Engineering.

[35]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[36]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Masato Akagi,et al.  Toward affective speech-to-speech translation: Strategy for emotional speech recognition and synthesis in multiple languages , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[38]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Nicholas B. Allen,et al.  On the importance of glottal flow spectral energy for the recognition of emotions in speech , 2010, INTERSPEECH.

[40]  Bayya Yegnanarayana,et al.  Extraction of Fundamental Frequency From Degraded Speech Using Temporal Envelopes at High SNR Frequencies , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[41]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[42]  Ngoc Thang Vu,et al.  Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech , 2017, INTERSPEECH.

[43]  Suryakanth V. Gangashetty,et al.  Detection of Replay Attacks Using Single Frequency Filtering Cepstral Coefficients , 2017, INTERSPEECH.

[44]  Andreas Stolcke,et al.  Modeling prosodic feature sequences for speaker recognition , 2005, Speech Commun..

[45]  H. B. Kekre,et al.  Speaker Identification using Spectrograms of Varying Frame Sizes , 2012 .

[46]  Bayya Yegnanarayana,et al.  Epoch extraction from emotional speech using single frequency filtering approach , 2017, Speech Commun..

[47]  M. Landau Acoustical Properties of Speech as Indicators of Depression and Suicidal Risk , 2008 .

[48]  Chenjian Wu,et al.  Text-independent speech emotion recognition using frequency adaptive features , 2018, Multimedia Tools and Applications.