Multi-Head Attention-Based Long Short-Term Memory for Depression Detection From Speech

Depression is a mental disorder that threatens the health and normal life of people. Hence, it is essential to provide an effective way to detect depression. However, research on depression detection mainly focuses on utilizing different parallel features from audio, video, and text for performance enhancement regardless of making full usage of the inherent information from speech. To focus on more emotionally salient regions of depression speech, in this research, we propose a multi-head time-dimension attention-based long short-term memory (LSTM) model. We first extract frame-level features to store the original temporal relationship of a speech sequence and then analyze their difference between speeches of depression and those of health status. Then, we study the performance of various features and use a modified feature set as the input of the LSTM layer. Instead of using the output of the traditional LSTM, multi-head time-dimension attention is employed to obtain more key time information related to depression detection by projecting the output into different subspaces. The experimental results show the proposed model leads to improvements of 2.3 and 10.3% over the LSTM model on the Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WOZ) and the Multi-modal Open Dataset for Mental-disorder Analysis (MODMA) corpus, respectively.

[1]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[2]  M. Hamilton A RATING SCALE FOR DEPRESSION , 1960, Journal of neurology, neurosurgery, and psychiatry.

[3]  Xiaowei Li,et al.  EEG-based mild depression recognition using convolutional neural network , 2019, Medical & Biological Engineering & Computing.

[4]  Keith Hawton,et al.  Risk factors for suicide in individuals with depression: a systematic review. , 2013, Journal of affective disorders.

[5]  Bo Zhao,et al.  Diversified Visual Attention Networks for Fine-Grained Object Classification , 2016, IEEE Transactions on Multimedia.

[6]  Gang Wang,et al.  Detecting Depression Using an Ensemble Logistic Regression Model Based on Multiple Speech Features , 2018, Comput. Math. Methods Medicine.

[7]  Daniel Sierra-Sosa,et al.  Deep Learning Techniques for Speech Emotion Recognition, from Databases to Models , 2021, Sensors.

[8]  Raveendran Paramesran,et al.  Speech emotion classification using combined neurogram and INTERSPEECH 2010 paralinguistic challenge features , 2017, IET Signal Process..

[9]  David DeVault,et al.  The Distress Analysis Interview Corpus of human and computer interviews , 2014, LREC.

[10]  Xiangang Li,et al.  Improving Transformer-based Speech Recognition Using Unsupervised Pre-training , 2019, ArXiv.

[11]  W. Zung A SELF-RATING DEPRESSION SCALE. , 1965, Archives of general psychiatry.

[12]  Fan Zhang,et al.  Artificial Intelligent System for Automatic Depression Level Analysis Through Visual and Vocal Expressions , 2018, IEEE Transactions on Cognitive and Developmental Systems.

[13]  Tatsuya Kawahara,et al.  Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning , 2019, INTERSPEECH.

[14]  Ruiyu Liang,et al.  Speech Emotion Classification Using Attention-Based LSTM , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Sunil Kumar Kopparapu,et al.  Multi-Conditioning and Data Augmentation Using Generative Noise Model for Speech Emotion Recognition in Noisy Conditions , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jianhua Tao,et al.  Conversational Emotion Analysis via Attention Mechanisms , 2019, INTERSPEECH.

[17]  T. Strine,et al.  The PHQ-8 as a measure of current depression in the general population. , 2009, Journal of affective disorders.

[18]  Eduardo Coutinho,et al.  The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language , 2016, INTERSPEECH.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Tiago H. Falk,et al.  Model Fusion for Multimodal Depression Classification and Level Detection , 2014, AVEC '14.

[21]  Shi Yin,et al.  A Multi-Modal Hierarchical Recurrent Neural Network for Depression Detection , 2019, AVEC@MM.

[22]  Jürgen Schmidhuber,et al.  Recurrent nets that time and count , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[23]  Zhenyu Liu,et al.  Detecting depression in speech: Comparison and combination between different speech types , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[24]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Koichi Shinoda,et al.  Multimodal Fusion of BERT-CNN and Gated CNN Representations for Depression Detection , 2019, AVEC@MM.

[26]  Jianfeng Zhao,et al.  Speech emotion recognition using deep 1D & 2D CNN LSTM networks , 2019, Biomed. Signal Process. Control..

[27]  Yiwen Gao,et al.  A multi-modal open dataset for mental-disorder analysis , 2020, Scientific data.

[28]  Yuxin Peng,et al.  The application of two-level attention models in deep convolutional neural network for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  R. Spitzer,et al.  The PHQ-9: A new depression diagnostic and severity measure , 2002 .

[30]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[31]  Dongmei Jiang,et al.  Multimodal Measurement of Depression Using Deep Learning Models , 2017, AVEC@ACM Multimedia.

[32]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.