论文信息 - Aalto's End-to-End DNN systems for the INTERSPEECH 2020 Computational Paralinguistics Challenge

Aalto's End-to-End DNN systems for the INTERSPEECH 2020 Computational Paralinguistics Challenge

End-to-end neural network models (E2E) have shown significant performance benefits on different INTERSPEECH ComParE tasks. Prior work has applied either a single instance of an E2E model for a task or the same E2E architecture for different tasks. However, applying a single model is unstable or using the same architecture under-utilizes task-specific information. On ComParE 2020 tasks, we investigate applying an ensemble of E2E models for robust performance and developing task-specific modifications for each task. ComParE 2020 introduces three sub-challenges: the breathing sub-challenge to predict the output of a respiratory belt worn by a patient while speaking, the elderly sub-challenge to estimate the elderly speaker's arousal and valence levels and the mask sub-challenge to classify if the speaker is wearing a mask or not. On each of these tasks, an ensemble outperforms the single E2E model. On the breathing sub-challenge, we study the impact of multi-loss strategies on task performance. On the elderly sub-challenge, predicting the valence and arousal levels prompts us to investigate multi-task training and implement data sampling strategies to handle class imbalance. On the mask sub-challenge, using an E2E system without feature engineering is competitive to feature-engineered baselines and provides substantial gains when combined with feature-engineered baselines.

Mikko Kurimo | Mittul Singh | Sudarsana Reddy Kadiri | Hemant Kathania | Tam'as Gr'osz

[1] Haishuai Wang,et al. Deep Spectrum Feature Representations for Speech Emotion Recognition , 2018, Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data.

[2] Gábor Gosztolya,et al. General Utterance-Level Feature Extraction for Classifying Crying Sounds, Atypical & Self-Assessed Affect and Heart Beats , 2018, INTERSPEECH.

[3] Bayya Yegnanarayana,et al. Breathy to Tense Voice Discrimination using Zero-Time Windowing Cepstral Coefficients (ZTWCCs) , 2018, INTERSPEECH.

[4] Rajib Rana,et al. Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion , 2019, ArXiv.

[5] A. W. M. van den Enden,et al. Discrete Time Signal Processing , 1989 .

[6] Eric Granger,et al. Feature Learning from Spectrograms for Assessment of Personality Traits , 2016, IEEE Transactions on Affective Computing.

[7] Björn W. Schuller,et al. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[8] Björn W. Schuller,et al. Snore Sound Classification Using Image-Based Deep Spectrum Features , 2017, INTERSPEECH.

[9] Johannes Wagner,et al. Deep Learning in Paralinguistic Recognition Tasks: Are Hand-crafted Features Still Relevant? , 2018, INTERSPEECH.

[10] S. Pavankumar Dubagunta,et al. Estimating the Degree of Sleepiness by Integrating Articulatory Feature Knowledge in Raw Waveform Based CNNS , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Stefanos Zafeiriou,et al. End2You - The Imperial Toolkit for Multimodal Profiling by End-to-End Learning , 2018, ArXiv.

[12] Fabien Ringeval,et al. At the Border of Acoustics and Linguistics: Bag-of-Audio-Words for the Recognition of Emotions in Speech , 2016, INTERSPEECH.

[13] Björn W. Schuller,et al. Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks , 2018 .