Is Everything Fine, Grandma? Acoustic and Linguistic Modeling for Robust Elderly Speech Emotion Recognition

Acoustic and linguistic analysis for elderly emotion recognition is an under-studied and challenging research direction, but essential for the creation of digital assistants for the elderly, as well as unobtrusive telemonitoring of elderly in their residences for mental healthcare purposes. This paper presents our contribution to the INTERSPEECH 2020 Computational Paralinguistics Challenge (ComParE) - Elderly Emotion Sub-Challenge, which is comprised of two ternary classification tasks for arousal and valence recognition. We propose a bi-modal framework, where these tasks are modeled using state-of-the-art acoustic and linguistic features, respectively. In this study, we demonstrate that exploiting task-specific dictionaries and resources can boost the performance of linguistic models, when the amount of labeled data is small. Observing a high mismatch between development and test set performances of various models, we also propose alternative training and decision fusion strategies to better estimate and improve the generalization performance.

[1]  Li-Ping Jing,et al.  Improved feature selection approach TFIDF in text mining , 2002, Proceedings. International Conference on Machine Learning and Cybernetics.

[2]  Roland Vollgraf,et al.  FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP , 2019, NAACL.

[3]  Yiqiang Chen,et al.  Weighted extreme learning machine for imbalance learning , 2013, Neurocomputing.

[4]  Björn W. Schuller,et al.  Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks , 2018 .

[5]  Bruno Trstenjak,et al.  on Intelligent Manufacturing and Automation , 2013 KNN with TF-IDF Based Framework for Text Categorization , 2014 .

[6]  Heysem Kaya,et al.  Fusing Acoustic Feature Representations for Computational Paralinguistics Tasks , 2016, INTERSPEECH.

[7]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[8]  Albert Ali Salah,et al.  Assessing Affective Dimensions of Play in Psychodynamic Child Psychotherapy via Text Analysis , 2016, HBU.

[9]  Björn W. Schuller,et al.  The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks , 2020, INTERSPEECH.

[10]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[11]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[12]  Albert Ali Salah,et al.  Predicting Depression and Emotions in the Cross-roads of Cultures, Para-linguistics, and Non-linguistics , 2019, AVEC@MM.

[13]  Björn W. Schuller,et al.  Snore Sound Classification Using Image-Based Deep Spectrum Features , 2017, INTERSPEECH.

[14]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[15]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[16]  Andrea Esuli,et al.  SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[17]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[18]  Wolfgang Minker,et al.  Ensembling End-to-End Deep Models for Computational Paralinguistics Tasks: ComParE 2020 Mask and Breathing Sub-Challenges , 2020, INTERSPEECH.

[19]  Gerhard Heyer,et al.  SentiWS - A Publicly Available German-language Resource for Sentiment Analysis , 2010, LREC.

[20]  Björn W. Schuller,et al.  Voice and Speech Analysis in Search of States and Traits , 2011, Computer Analysis of Human Behavior.

[21]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Hongming Zhou,et al.  Extreme Learning Machine for Regression and Multiclass Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[23]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.

[24]  Djoerd Hiemstra,et al.  A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[25]  Björn W. Schuller,et al.  openXBOW - Introducing the Passau Open-Source Crossmodal Bag-of-Words Toolkit , 2016, J. Mach. Learn. Res..

[26]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[27]  Albert Ali Salah,et al.  Fisher vectors with cascaded normalization for paralinguistic analysis , 2015, INTERSPEECH.

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Albert Ali Salah,et al.  Emotion, age, and gender classification in children's speech by humans and machines , 2017, Comput. Speech Lang..

[30]  Heysem Kaya,et al.  Introducing Weighted Kernel Classifiers for Handling Imbalanced Paralinguistic Corpora: Snoring, Addressee and Cold , 2017, INTERSPEECH.