Predicting Mood Disorder Symptoms with Remotely Collected Videos Using an Interpretable Multimodal Dynamic Attention Fusion Network

We developed a novel, interpretable multimodal classification method to identify symptoms of mood disorders viz. depression, anxiety and anhedonia using audio, video and text collected from a smartphone application. We used CNN-based unimodal encoders to learn dynamic embeddings for each modality and then combined these through a transformer encoder. We applied these methods to a novel dataset collected by a smartphone application on 3002 participants across up to three recording sessions. Our method demonstrated better multimodal classification performance compared to existing methods that employed static embeddings. Lastly, we used SHapley Additive exPlanations (SHAP) to prioritize important features in our model that could serve as potential digital markers.

[1]  Thomas F. Quatieri,et al.  A review of depression and suicide risk assessment using speech analysis , 2015, Speech Commun..

[2]  Fabien Ringeval,et al.  AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge , 2017, AVEC@ACM Multimedia.

[3]  Li Fei-Fei,et al.  Measuring Depression Symptom Severity from Spoken Language and 3D Facial Expressions , 2018, ArXiv.

[4]  Ariel V. Dowling,et al.  Verification, analytical validation, and clinical validation (V3): the foundation of determining fit-for-purpose for Biometric Monitoring Technologies (BioMeTs) , 2020, npj Digital Medicine.

[5]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[6]  David DeVault,et al.  The Distress Analysis Interview Corpus of human and computer interviews , 2014, LREC.

[7]  Sriparna Saha,et al.  The Verbal and Non Verbal Signals of Depression - Combining Acoustics, Text and Visuals for Estimating Depression Level , 2019, ArXiv.

[8]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[9]  R. Spitzer,et al.  The PHQ-9: A new depression diagnostic and severity measure , 2002 .

[10]  Amy Beth Warriner,et al.  Norms of valence, arousal, and dominance for 13,915 English lemmas , 2013, Behavior Research Methods.

[11]  Ritu Garg,et al.  Multi-level Attention Network using Text, Audio and Video for Depression Prediction , 2019, AVEC@MM.

[12]  James R. Glass,et al.  Detecting Depression with Audio/Text Sequence Modeling of Interviews , 2018, INTERSPEECH.

[13]  B. Löwe,et al.  A brief measure for assessing generalized anxiety disorder: the GAD-7. , 2006, Archives of internal medicine.

[14]  Manolis Tsiknakis,et al.  Automatic Assessment of Depression Based on Visual Cues: A Systematic Review , 2019, IEEE Transactions on Affective Computing.

[15]  Mohammad Soleymani,et al.  AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition , 2019, AVEC@MM.

[16]  Fabien Ringeval,et al.  Summary for AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, ACM Multimedia.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[19]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[20]  M. Hamilton,et al.  A Scale for the Assessment of Hedonic Tone the Snaith–Hamilton Pleasure Scale , 1995, British Journal of Psychiatry.