Self-Assessed Affect Recognition Using Fusion of Attentional BLSTM and Static Acoustic Features

In this study, we present a computational framework to participate in the Self-Assessed Affect Sub-Challenge in the INTERSPEECH 2018 Computation Paralinguistics Challenge. The goal of this sub-challenge is to classify the valence scores given by the speaker themselves into three different levels, i.e., low, medium, and high. We explore fusion of Bi-directional LSTM with baseline SVM models to improve the recognition accuracy. In specifics, we extract frame-level acoustic LLDs as input to the BLSTM with a modified attention mechanism, and separate SVMs are trained using the standard ComParE 16 baseline feature sets with minority class upsampling. These diverse prediction results are then further fused using a decision-level score fusion scheme to integrate all of the developed models. Our proposed approach achieves a 62.94% and 67.04% unweighted average recall (UAR), which is an 6.24% and 1.04% absolute improvement over the best baseline provided by the challenge organizer. We further provide a detailed comparison analysis between different models.

[1]  Y. Benyamini,et al.  Positive affect and function as influences on self-assessments of health: expanding our view beyond illness and disability. , 2000, The journals of gerontology. Series B, Psychological sciences and social sciences.

[2]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[3]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[4]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  S. Fishbane,et al.  Self-assessed physical and mental function of haemodialysis patients. , 2001, Nephrology, dialysis, transplantation : official publication of the European Dialysis and Transplant Association - European Renal Association.

[8]  P. Bolton,et al.  Speech-in-noise perception in high-functioning individuals with autism or Asperger's syndrome. , 2004, Journal of child psychology and psychiatry, and allied disciplines.

[9]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[10]  Isabelle Peretz,et al.  Emotional Recognition from Face, Voice, and Music in Dementia of the Alzheimer Type , 2009, Annals of the New York Academy of Sciences.

[11]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[12]  R. D'Agostino,et al.  Central Auditory Dysfunction May Precede the Onset of Clinical Dementia in People with Probable Alzheimer's Disease , 2002, Journal of the American Geriatrics Society.

[13]  Thia Kirubarajan,et al.  Estimation and Decision Fusion: A Survey , 2006, 2006 IEEE International Conference on Engineering of Intelligent Systems.

[14]  K. Scherer,et al.  On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common , 2013, Front. Psychol..

[15]  George Trigeorgis,et al.  The INTERSPEECH 2017 Computational Paralinguistics Challenge: Addressee, Cold & Snoring , 2017, INTERSPEECH.