LSSED: A Large-Scale Dataset and Benchmark for Speech Emotion Recognition

Speech emotion recognition is a vital contributor to the next generation of human-computer interaction (HCI). However, current existing small-scale databases have limited the development of related research. In this paper, we present LSSED, a challenging large-scale english speech emotion dataset, which has data collected from 820 subjects to simulate real- world distribution. In addition, we release some pre-trained models based on LSSED, which can not only promote the development of speech emotion recognition, but can also be transferred to related downstream tasks such as mental health analysis where data is extremely difficult to collect. Finally, our experiments show the necessity of large-scale datasets and the effectiveness of pre-trained models. The dateset will be released on https://github.com/tobefans/LSSED.

[1]  Wen Gao,et al.  Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching , 2018, IEEE Transactions on Multimedia.

[2]  Jonathan Boigne,et al.  Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning , 2020, ArXiv.

[3]  Jiqing Han,et al.  Nonnegative Matrix Factorization Based Transfer Subspace Learning for Cross-Corpus Speech Emotion Recognition , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[5]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[6]  Chi-Chun Lee,et al.  A Dialogical Emotion Decoder for Speech Motion Recognition in Spoken Dialog , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Tong Zhang,et al.  Cross-Corpus Speech Emotion Recognition Based on Domain-Adaptive Least-Squares Regression , 2016, IEEE Signal Processing Letters.

[8]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[9]  Ioannis Pitas,et al.  The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[10]  Ling Guan,et al.  Recognizing Human Emotional State From Audiovisual Signals , 2008, IEEE Transactions on Multimedia.

[11]  Yuanyuan Zhang,et al.  Attention Based Fully Convolutional Network for Speech Emotion Recognition , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[12]  Abhijit Karmakar,et al.  Speech Enhancement using Spectral Subtraction-type Algorithms: A Comparison and Simulation Study , 2015 .

[13]  David DeVault,et al.  The Distress Analysis Interview Corpus of human and computer interviews , 2014, LREC.

[14]  Rada Mihalcea,et al.  MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[15]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[16]  Ron Hoory,et al.  Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms , 2017, INTERSPEECH.

[17]  Na Liu,et al.  Unsupervised Cross-Corpus Speech Emotion Recognition Using Domain-Adaptive Subspace Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Carlos Busso,et al.  Domain Adversarial for Acoustic Emotion Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Ling Shao,et al.  Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition , 2020, ArXiv.

[20]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21]  Tillman Weyde,et al.  Multi-Time-Scale Convolution for Emotion Recognition from Speech Audio Signals , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Elmar Nöth,et al.  “You Stupid Tin Box” - Children Interacting with the AIBO Robot: A Cross-linguistic Emotional Speech Corpus , 2004, LREC.

[24]  Zhao Ren,et al.  Generating and Protecting Against Adversarial Attacks for Deep Speech-Based Emotion Recognition Models , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.