Multimodal and Multiresolution Depression Detection from Speech and Facial Landmark Features

Automatic classification of depression using audiovisual cues can help towards its objective diagnosis. In this paper, we present a multimodal depression classification system as a part of the 2016 Audio/Visual Emotion Challenge and Workshop (AVEC2016). We investigate a number of audio and video features for classification with different fusion techniques and temporal contexts. In the audio modality, Teager energy cepstral coefficients~(TECC) outperform standard baseline features; while the best accuracy is achieved with i-vector modelling based on MFCC features. On the other hand, polynomial parameterization of facial landmark features achieves the best performance among all systems and outperforms the best baseline system as well.

[1]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[2]  J. Darby,et al.  Speech and voice parameters of depression: a pilot study. , 1984, Journal of communication disorders.

[3]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[4]  Roland Göcke,et al.  Modeling spectral variability for the classification of depressed speech , 2013, INTERSPEECH.

[5]  Michael Wagner,et al.  Head Pose and Movement Analysis as an Indicator of Depression , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[6]  T. Strine,et al.  The PHQ-8 as a measure of current depression in the general population. , 2009, Journal of affective disorders.

[7]  M. Landau Acoustical Properties of Speech as Indicators of Depression and Suicidal Risk , 2008 .

[8]  Fernando De la Torre,et al.  Detecting depression from facial actions and vocal prosody , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[9]  Haoqi Li,et al.  Sparsely Connected and Disjointly Trained Deep Neural Networks for Low Resource Behavioral Annotation: Acoustic Classification in Couples' Therapy , 2016, INTERSPEECH.

[10]  Keith Hawton,et al.  Risk factors for suicide in individuals with depression: a systematic review. , 2013, Journal of affective disorders.

[11]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[12]  Patrick Kenny,et al.  Mixture of PLDA Models in i-vector Space for Gender-Independent Speaker Recognition , 2011, INTERSPEECH.

[13]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[14]  Patrick Kenny,et al.  An i-vector Extractor Suitable for Speaker Recognition with both Microphone and Telephone Speech , 2010, Odyssey.

[15]  Thomas F. Quatieri,et al.  Vocal-Source Biomarkers for Depression: A Link to Psychomotor Activity , 2012, INTERSPEECH.

[16]  Panayiotis G. Georgiou,et al.  Multimodal Fusion of Multirate Acoustic, Prosodic, and Lexical Speaker Characteristics for Native Language Identification , 2016, INTERSPEECH.

[17]  D. Mitchell Wilkes,et al.  Acoustical properties of speech as indicators of depression and suicidal risk , 2000, IEEE Transactions on Biomedical Engineering.

[18]  S. Saxena,et al.  Depression: a global public health concern , 2012 .

[19]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[20]  A. W. Siegman,et al.  Anxiety and depression in speech. , 1970, Journal of consulting and clinical psychology.

[21]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[22]  Hongying Meng,et al.  Descriptive temporal template features for visual motion recognition , 2009, Pattern Recognit. Lett..

[23]  M. Lech,et al.  Prediction of clinical depression in adolescents using facial image analaysis , 2011, WIAMIS 2011.

[24]  D. Mitchell Wilkes,et al.  Investigation of vocal jitter and glottal flow spectrum as possible cues for depression and near-term suicidal risk , 2004, IEEE Transactions on Biomedical Engineering.

[25]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Michael Wagner,et al.  Detecting depression: A comparison between spontaneous and read speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  J Sundberg,et al.  Measuring the rate of change of voice fundamental frequency in fluent speech during mental depression. , 1988, The Journal of the Acoustical Society of America.

[28]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[29]  Sascha Meudt,et al.  Fusion of Audio-visual Features using Hierarchical Classifier Systems for the Recognition of Affective States and the State of Depression , 2014, ICPRAM.

[30]  David DeVault,et al.  The Distress Analysis Interview Corpus of human and computer interviews , 2014, LREC.

[31]  Elliot Moore,et al.  Critical Analysis of the Impact of Glottal Features in the Classification of Clinical Depression in Speech , 2008, IEEE Transactions on Biomedical Engineering.

[32]  J. F. Kaiser,et al.  On a simple algorithm to calculate the 'energy' of a signal , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[33]  Gwen Littlewort,et al.  The computer expression recognition toolbox (CERT) , 2011, Face and Gesture 2011.

[34]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[35]  Vassilis Digalakis,et al.  Speech Emotion Recognition using non-linear Teager energy based features in noisy environments , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[36]  R. Carney,et al.  Depression as a risk factor for cardiac mortality and morbidity: a review of potential mechanisms. , 2002, Journal of psychosomatic research.

[37]  Roland Göcke,et al.  An Investigation of Depressed Speech Detection: Features and Normalization , 2011, INTERSPEECH.

[38]  Shrikanth S. Narayanan,et al.  Simplified and supervised i-vector modeling for speaker age regression , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Rui Xia,et al.  Using i-Vector Space Model for Emotion Recognition , 2012, INTERSPEECH.

[40]  John H. L. Hansen,et al.  Classification of speech under stress based on features derived from the nonlinear Teager energy operator , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[41]  Heng Wang,et al.  Depression recognition based on dynamic facial and vocal expression features using partial least square regression , 2013, AVEC@ACM Multimedia.

[42]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[43]  C. Bradshaw,et al.  Elongation of Pause-Time in Speech: A Simple, Objective Measure of Motor Retardation in Depression , 1976, British Journal of Psychiatry.

[44]  V. A. Kral The Relationship between Senile Dementia (Alzheimer Type) and Depression * , 1983, Canadian journal of psychiatry. Revue canadienne de psychiatrie.

[45]  Maja Pantic,et al.  Combined Support Vector Machines and Hidden Markov Models for Modeling Facial Action Temporal Dynamics , 2007, ICCV-HCI.

[46]  Thomas F. Quatieri,et al.  Phonologically-based biomarkers for major depressive disorder , 2011, EURASIP J. Adv. Signal Process..

[47]  R. Gur,et al.  Automated Facial Action Coding System for dynamic analysis of facial expressions in neuropsychiatric disorders , 2011, Journal of Neuroscience Methods.

[48]  A. Mitchell,et al.  Clinical diagnosis of depression in primary care: a meta-analysis , 2009, The Lancet.

[49]  Ronald C Kessler,et al.  The economic burden of adults with major depressive disorder in the United States (2005 and 2010). , 2015, The Journal of clinical psychiatry.

[50]  R. Gur,et al.  Automated video-based facial expression analysis of neuropsychiatric disorders , 2008, Journal of Neuroscience Methods.