AudiBERT: A Deep Transfer Learning Multimodal Classification Framework for Depression Screening

Depression is a leading cause of disability with tremendous socioeconomic costs. In spite of early detection being crucial to improving prognosis, this mental illness remains largely undiagnosed. Depression classification from voice holds the promise to revolutionize diagnosis by ubiquitously integrating this screening capability into virtual assistants and smartphone technologies. Unfortunately, due to privacy concerns, audio datasets with depression labels have a small number of participants, causing current classification models to suffer from low performance. To tackle this challenge, we introduce Audio-Assisted BERT (AudiBERT), a novel deep learning framework that leverages the multimodal nature of human voice. To alleviate the small data problem, AudiBERT integrates pretrained audio and text representation models for the respective modalities augmented by a dual self-attention mechanism into a deep learning architecture. AudiBERT applied to depression classification consistently achieves promising performance with an increase in F1 scores between 6% and 30% compared to state-of-the-art audio and text models for 15 thematic question datasets. Using answers from medically targeted and general wellness questions, our framework achieves F1 scores of up to 0.92 and 0.86, respectively, demonstrating the feasibility of depression screening from informal dialogue using voice-enabled technologies.

[1]  Michael Wagner,et al.  From Joyous to Clinically Depressed: Mood Detection Using Spontaneous Speech , 2012, FLAIRS.

[2]  Paul R. Duberstein,et al.  “I Didn’t Know What Was Wrong:” How People With Undiagnosed Depression Recognize, Name and Explain Their Distress , 2010, Journal of General Internal Medicine.

[3]  Sinha Interweaving Convolutions : An application to Audio Classification Harsh , 2018 .

[4]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5]  Koichi Shinoda,et al.  Multimodal Fusion of BERT-CNN and Gated CNN Representations for Depression Detection , 2019, AVEC@MM.

[6]  Germán Castellanos-Domínguez,et al.  Automatic age detection in normal and pathological voice , 2015, INTERSPEECH.

[7]  C. Falicov Culture, society and gender in depression , 2003 .

[8]  Nicholas B. Allen,et al.  Detection of Clinical Depression in Adolescents’ Speech During Family Interactions , 2011, IEEE Transactions on Biomedical Engineering.

[9]  A. Feigl,et al.  The Global Economic Burden of Noncommunicable Diseases , 2012 .

[10]  Yan Song,et al.  Robust Sound Event Classification Using Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  E. Agu,et al.  Moodable: On feasibility of instantaneous depression assessment using machine learning on voice samples with retrospectively harvested smartphone and social media data , 2020 .

[12]  Dongmei Jiang,et al.  Decision Tree Based Depression Classification from Audio Video and Language Information , 2016, AVEC@ACM Multimedia.

[13]  M. Barclay,et al.  Manic-Depressive Insanity and Paranoia , 1921, The Indian Medical Gazette.

[14]  Pedro Gómez Vilda,et al.  Methodological issues in the development of automatic systems for voice pathology detection , 2006, Biomed. Signal Process. Control..

[15]  Huy Phan,et al.  Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks , 2016, INTERSPEECH.

[16]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Shrikanth S. Narayanan,et al.  Improving Gender Identification in Movie Audio Using Cross-Domain Data , 2018, INTERSPEECH.

[20]  Chung-Hsien Wu,et al.  Detecting Unipolar and Bipolar Depressive Disorders from Elicited Speech Responses Using Latent Affective Structure Model , 2020, IEEE Transactions on Affective Computing.

[21]  Thomas F. Quatieri,et al.  A review of depression and suicide risk assessment using speech analysis , 2015, Speech Commun..

[22]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[23]  R. Spitzer,et al.  Validation and utility of a self-report version of PRIME-MD: the PHQ primary care study. Primary Care Evaluation of Mental Disorders. Patient Health Questionnaire. , 1999, JAMA.

[24]  Lan Zhang,et al.  A Novel Decision Tree for Depression Recognition in Speech , 2020, ArXiv.

[25]  M. L. Tlachac,et al.  EMU: Early Mental Health Uncovering Framework and Dataset , 2021, International Conference on Machine Learning and Applications.

[26]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[27]  Audio-based Depression Screening using Sliding Window Sub-clip Pooling , 2020, 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA).

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  David DeVault,et al.  The Distress Analysis Interview Corpus of human and computer interviews , 2014, LREC.

[30]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[31]  M. Phipps,et al.  Screening for Depression in Adults: US Preventive Services Task Force Recommendation Statement. , 2016, JAMA.

[32]  Hamdi Dibeklioglu,et al.  Multimodal Detection of Depression in Clinical Interviews , 2015, ICMI.

[33]  Elke A. Rundensteiner,et al.  Improving Emotion Detection with Sub-clip Boosting , 2018, ECML/PKDD.

[34]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[35]  E. Liebenthal,et al.  The Language, Tone and Prosody of Emotions: Neural Substrates and Dynamics of Spoken-Word Emotion Perception , 2016, Front. Neurosci..

[36]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[37]  Theodoros Iliou,et al.  Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 , 2012, Artificial Intelligence Review.

[38]  Yoshua Bengio,et al.  Interpretable Convolutional Filters with SincNet , 2018, ArXiv.

[39]  Yunhong Wang,et al.  DepAudioNet: An Efficient Deep Model for Audio based Depression Classification , 2016, AVEC@ACM Multimedia.

[40]  R. Spitzer,et al.  The PHQ-9: validity of a brief depression severity measure. , 2001, Journal of general internal medicine.

[41]  G. Wrobel,et al.  Mental health screening in schools. , 2007, The Journal of school health.

[42]  Felix Burkhardt,et al.  A Database of Age and Gender Annotated Telephone Speech , 2010, LREC.

[43]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[44]  P. Falkai,et al.  Machine Learning Approaches for Clinical Psychology and Psychiatry. , 2018, Annual review of clinical psychology.

[45]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[46]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[47]  James R. Glass,et al.  Detecting Depression with Audio/Text Sequence Modeling of Interviews , 2018, INTERSPEECH.

[48]  Rosalind W. Picard,et al.  Establishing the computer-patient working alliance in automated health behavior change interventions. , 2005, Patient education and counseling.

[49]  Kallirroi Georgila,et al.  SimSensei kiosk: a virtual human interviewer for healthcare decision support , 2014, AAMAS.

[50]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).