Multi-Stream Attention-Based BLSTM with Feature Segmentation for Speech Emotion Recognition

This paper proposes a speech emotion recognition technique that considers the suprasegmental characteristics and temporal change of individual speech parameters. In recent years, speech emotion recognition using Bidirectional LSTM (BLSTM) has been studied actively because the model can focus on a particular temporal region that contains strong emotional characteristics. One of the model’s weaknesses is that it cannot consider the statistics of speech features, which are known to be effective for speech emotion recognition. Besides, this method cannot train individual attention parameters for different descriptors because it handles the input sequence by a single BLSTM. In this paper, we introduce feature segmentation and multi-stream processing into attention-based BLSTM to solve these problems. In addition, we employed data augmentation based on emotional speech synthesis in a training step. The classification experiments between four emotions (i.e., anger, joy, neutral, and sadness) using the Japanese Twitter-based Emotional Speech corpus (JTES) showed that the proposed method obtained a recognition accuracy of 73.4%, which is comparable to human evaluation (75.5%).

[1]  Ryo Masumura,et al.  Speech Emotion Recognition Based on Multi-Label Emotion Existence Model , 2019, INTERSPEECH.

[2]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[3]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[4]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Aurobinda Routray,et al.  Databases, features and classifiers for speech emotion recognition: a review , 2018, International Journal of Speech Technology.

[6]  Ron Hoory,et al.  Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms , 2017, INTERSPEECH.

[7]  Shuiyang Mao,et al.  Deep Learning of Segment-Level Feature Representation with Multiple Instance Learning for Utterance-Level Speech Emotion Recognition , 2019, INTERSPEECH.

[8]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[9]  Lianhong Cai,et al.  Speech Emotion Recognition with Emotion-Pair Based Framework Considering Emotion Distribution Information in Dimensional Emotion Space , 2017, INTERSPEECH.

[10]  Lianhong Cai,et al.  Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms , 2018, INTERSPEECH.

[11]  Elmar Nöth,et al.  Multimodal User State Recognition in a Modern Dialogue System , 2003, KI.

[12]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[13]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[14]  H. Schlosberg Three dimensions of emotion. , 1954, Psychological review.

[15]  Joel R. Tetreault,et al.  Using system and user performance features to improve emotion detection in spoken tutoring dialogs , 2006, INTERSPEECH.

[16]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[17]  Eduardo Coutinho,et al.  The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language , 2016, INTERSPEECH.

[18]  Shi-wook Lee,et al.  The Generalization Effect for Multilingual Speech Emotion Recognition across Heterogeneous Languages , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Takashi Nose,et al.  Construction and analysis of phonetically and prosodically balanced emotional speech database , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).

[21]  Yuanyuan Zhang,et al.  Attention Based Fully Convolutional Network for Speech Emotion Recognition , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[22]  Shrikanth S. Narayanan,et al.  A hierarchical static-dynamic framework for emotion classification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[24]  Emily Mower Provost,et al.  Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.