Aalto's End-to-End DNN systems for the INTERSPEECH 2020 Computational Paralinguistics Challenge

End-to-end neural network models (E2E) have shown significant performance benefits on different INTERSPEECH ComParE tasks. Prior work has applied either a single instance of an E2E model for a task or the same E2E architecture for different tasks. However, applying a single model is unstable or using the same architecture under-utilizes task-specific information. On ComParE 2020 tasks, we investigate applying an ensemble of E2E models for robust performance and developing task-specific modifications for each task. ComParE 2020 introduces three sub-challenges: the breathing sub-challenge to predict the output of a respiratory belt worn by a patient while speaking, the elderly sub-challenge to estimate the elderly speaker's arousal and valence levels and the mask sub-challenge to classify if the speaker is wearing a mask or not. On each of these tasks, an ensemble outperforms the single E2E model. On the breathing sub-challenge, we study the impact of multi-loss strategies on task performance. On the elderly sub-challenge, predicting the valence and arousal levels prompts us to investigate multi-task training and implement data sampling strategies to handle class imbalance. On the mask sub-challenge, using an E2E system without feature engineering is competitive to feature-engineered baselines and provides substantial gains when combined with feature-engineered baselines.

[1]  Haishuai Wang,et al.  Deep Spectrum Feature Representations for Speech Emotion Recognition , 2018, Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data.

[2]  Gábor Gosztolya,et al.  General Utterance-Level Feature Extraction for Classifying Crying Sounds, Atypical & Self-Assessed Affect and Heart Beats , 2018, INTERSPEECH.

[3]  Bayya Yegnanarayana,et al.  Breathy to Tense Voice Discrimination using Zero-Time Windowing Cepstral Coefficients (ZTWCCs) , 2018, INTERSPEECH.

[4]  Rajib Rana,et al.  Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion , 2019, ArXiv.

[5]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[6]  Eric Granger,et al.  Feature Learning from Spectrograms for Assessment of Personality Traits , 2016, IEEE Transactions on Affective Computing.

[7]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[8]  Björn W. Schuller,et al.  Snore Sound Classification Using Image-Based Deep Spectrum Features , 2017, INTERSPEECH.

[9]  Johannes Wagner,et al.  Deep Learning in Paralinguistic Recognition Tasks: Are Hand-crafted Features Still Relevant? , 2018, INTERSPEECH.

[10]  S. Pavankumar Dubagunta,et al.  Estimating the Degree of Sleepiness by Integrating Articulatory Feature Knowledge in Raw Waveform Based CNNS , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Stefanos Zafeiriou,et al.  End2You - The Imperial Toolkit for Multimodal Profiling by End-to-End Learning , 2018, ArXiv.

[12]  Fabien Ringeval,et al.  At the Border of Acoustics and Linguistics: Bag-of-Audio-Words for the Recognition of Emotions in Speech , 2016, INTERSPEECH.

[13]  Björn W. Schuller,et al.  Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks , 2018 .

[14]  Paavo Alku,et al.  Mel-Frequency Cepstral Coefficients of Voice Source Waveforms for Classification of Phonation Types in Speech , 2019, INTERSPEECH.

[15]  Claude Montacié,et al.  Vocalic, Lexical and Prosodic Cues for the INTERSPEECH 2018 Self-Assessed Affect Challenge , 2018, INTERSPEECH.

[16]  Gábor Gosztolya,et al.  Using Fisher Vector and Bag-of-Audio-Words Representations to Identify Styrian Dialects, Sleepiness, Baby & Orca Sounds , 2019, INTERSPEECH.

[17]  Róbert Busa-Fekete,et al.  DNN-Based Feature Extraction and Classifier Combination for Child-Directed Speech, Cold and Snoring Identification , 2017, INTERSPEECH.

[18]  Björn W. Schuller,et al.  The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks , 2020, INTERSPEECH.

[19]  Ming Li,et al.  The DKU-LENOVO Systems for the INTERSPEECH 2019 Computational Paralinguistic Challenge , 2019, INTERSPEECH.

[20]  Robert Müller,et al.  Deep Neural Baselines for Computational Paralinguistics , 2019, INTERSPEECH.