Model Fusion for Multimodal Depression Classification and Level Detection

Audio-visual emotion and mood disorder cues have been recently explored to develop tools to assist psychologists and psychiatrists in evaluating a patient's level of depression. In this paper, we present a number of different multimodal depression level predictors using a model fusion approach, in the context of the AVEC14 challenge. We show that an i-vector based representation for short term audio features contains useful information for depression classification and prediction. We also employed a classification step prior to regression to allow having different regression models depending on the presence or absence of depression. Our experiments show that a combination of our audio-based model and two other models based on the LGBP-TOP video features lead to an improvement of 4% over the baseline model proposed by the challenge organizers.

[1]  Larry P. Heck,et al.  MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research , 2013 .

[2]  Thomas F. Quatieri,et al.  Vocal biomarkers of depression based on motor incoordination , 2013, AVEC@ACM Multimedia.

[3]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[4]  Roland Göcke,et al.  Diagnosis of depression by behavioural signals: a multimodal approach , 2013, AVEC@ACM Multimedia.

[5]  Themos Stafylakis,et al.  A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  James R. Glass,et al.  Exploiting Intra-Conversation Variability for Speaker Diarization , 2011, INTERSPEECH.

[7]  Björn W. Schuller,et al.  AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge , 2014, AVEC '14.

[8]  Roland Göcke,et al.  An Investigation of Depressed Speech Detection: Features and Normalization , 2011, INTERSPEECH.

[9]  Thomas F. Quatieri,et al.  On the relative importance of vocal source, system, and prosody in human depression , 2013, 2013 IEEE International Conference on Body Sensor Networks.

[10]  Vidhyasaharan Sethu,et al.  Variability compensation in small data: Oversampled extraction of i-vectors for the classification of depressed speech , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  A. Beck,et al.  Comparison of Beck Depression Inventories -IA and -II in psychiatric outpatients. , 1996, Journal of personality assessment.

[12]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[15]  Carmen García-Mateo,et al.  A study of acoustic features for the classification of depressed speech , 2014, 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[16]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[17]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[18]  Douglas E. Sturim,et al.  Automatic Detection of Depression in Speech Using Gaussian Mixture Modeling with Factor Analysis , 2011, INTERSPEECH.

[19]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..