论文信息 - Self-Assessed Affect Recognition Using Fusion of Attentional BLSTM and Static Acoustic Features

Self-Assessed Affect Recognition Using Fusion of Attentional BLSTM and Static Acoustic Features

In this study, we present a computational framework to participate in the Self-Assessed Affect Sub-Challenge in the INTERSPEECH 2018 Computation Paralinguistics Challenge. The goal of this sub-challenge is to classify the valence scores given by the speaker themselves into three different levels, i.e., low, medium, and high. We explore fusion of Bi-directional LSTM with baseline SVM models to improve the recognition accuracy. In specifics, we extract frame-level acoustic LLDs as input to the BLSTM with a modified attention mechanism, and separate SVMs are trained using the standard ComParE 16 baseline feature sets with minority class upsampling. These diverse prediction results are then further fused using a decision-level score fusion scheme to integrate all of the developed models. Our proposed approach achieves a 62.94% and 67.04% unweighted average recall (UAR), which is an 6.24% and 1.04% absolute improvement over the best baseline provided by the challenge organizer. We further provide a detailed comparison analysis between different models.

[1] Y. Benyamini,et al. Positive affect and function as influences on self-assessments of health: expanding our view beyond illness and disability. , 2000, The journals of gerontology. Series B, Psychological sciences and social sciences.

[2] Jürgen Schmidhuber,et al. Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[3] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[4] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5] Seyedmahdad Mirsamadi,et al. Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[7] S. Fishbane,et al. Self-assessed physical and mental function of haemodialysis patients. , 2001, Nephrology, dialysis, transplantation : official publication of the European Dialysis and Transplant Association - European Renal Association.

[8] P. Bolton,et al. Speech-in-noise perception in high-functioning individuals with autism or Asperger's syndrome. , 2004, Journal of child psychology and psychiatry, and allied disciplines.

[9] Haibo He,et al. Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[10] Isabelle Peretz,et al. Emotional Recognition from Face, Voice, and Music in Dementia of the Alzheimer Type , 2009, Annals of the New York Academy of Sciences.

[11] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[12] R. D'Agostino,et al. Central Auditory Dysfunction May Precede the Onset of Clinical Dementia in People with Probable Alzheimer's Disease , 2002, Journal of the American Geriatrics Society.

[13] Thia Kirubarajan,et al. Estimation and Decision Fusion: A Survey , 2006, 2006 IEEE International Conference on Engineering of Intelligent Systems.

[14] K. Scherer,et al. On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common , 2013, Front. Psychol..

[15] George Trigeorgis,et al. The INTERSPEECH 2017 Computational Paralinguistics Challenge: Addressee, Cold & Snoring , 2017, INTERSPEECH.