Assessing speaker independence on a speech-based depression level estimation system

Depression can be considered a psychological state related soft biometric traitSpeaker dependence of an iVector based depression level estimation system is assessedSystem performance is much better when the test speaker is on the training setExperimental frameworks must be carefully designed to avoid biasing the experimentsWe introduce a new metric for assessing depression classification systems Soft biometrics refers to traits that provide valuable information about an individual without being sufficient for their authentication, as they lack uniqueness and distinctiveness. This definition includes features related to the psychological state of individuals, such as emotions or mental health disorders like depression. Depression has recently been attracting the attention of speech researchers, with audio/visual emotion challenge (AVEC) 2013 and 2014 organized to encourage researchers to develop approaches to accurately estimate speaker depression level. The evaluation frameworks provided for these evaluations do not take speaker independence into account in experiment design, despite this being an important factor in developing a robust speech based system. We assess the influence of prior knowledge of the speakers in a depression estimation experiment, using an iVector-based state-of-the-art approach to depression level estimation to perform a speaker-dependent experiment and a speaker-independent experiment. We conclude that having previous information about the depression level of a given speaker dramatically improves system performance. Hence, we suggest that experimental frameworks must be carefully designed in order to serve as a genuinely useful resource for the development of robust depression estimation systems.

[1]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[2]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Roland Göcke,et al.  Diagnosis of depression by behavioural signals: a multimodal approach , 2013, AVEC@ACM Multimedia.

[4]  Heng Wang,et al.  Depression recognition based on dynamic facial and vocal expression features using partial least square regression , 2013, AVEC@ACM Multimedia.

[5]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[6]  Mireia Díez,et al.  KALAKA-2: a TV Broadcast Speech Database for the Recognition of Iberian Languages in Clean and Noisy Environments , 2012, LREC.

[7]  Sébastien Marcel,et al.  Audio-visual gender recognition in uncontrolled environment using variability modeling techniques , 2014, IEEE International Joint Conference on Biometrics.

[8]  M. Hamilton A RATING SCALE FOR DEPRESSION , 1960, Journal of neurology, neurosurgery, and psychiatry.

[9]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[10]  Carmen García-Mateo,et al.  A study of acoustic features for depression detection , 2014, 2nd International Workshop on Biometrics and Forensics.

[11]  A. Beck,et al.  Comparison of Beck Depression Inventories -IA and -II in psychiatric outpatients. , 1996, Journal of personality assessment.

[12]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[13]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[14]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Björn W. Schuller,et al.  AVEC 2013: the continuous audio/visual emotion and depression recognition challenge , 2013, AVEC@ACM Multimedia.

[16]  Dennis C. Tanner Speaker Profiling Persons with Communication Disorders , 2008 .

[17]  Dimitra Vergyri,et al.  The SRI AVEC-2014 Evaluation System , 2014, AVEC '14.

[18]  Arne Schumann,et al.  A soft-biometrics dataset for person tracking and re-identification , 2014, 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[19]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[21]  Sumit Basu A linked-HMM model for robust voicing and speech detection , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[22]  Sadiye Guler,et al.  Automated person categorization for video surveillance using soft biometrics , 2010, Defense + Commercial Sensing.

[23]  Roman Wyrzykowski,et al.  Mental Characteristics of Person as Basic Biometrics , 2002, Biometric Authentication.

[24]  Meysam Asgari,et al.  Inferring clinical depression from speech and spoken utterances , 2014, 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

[25]  Björn W. Schuller,et al.  AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge , 2014, AVEC '14.

[26]  Maja Pantic,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING , 2022 .

[27]  Michael Wagner,et al.  Characterising depressed speech for classification , 2013, INTERSPEECH.

[28]  Nicholas B. Allen,et al.  Influence of acoustic low-level descriptors in the detection of clinical depression in adolescents , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Thomas F. Quatieri,et al.  Vocal biomarkers of depression based on motor incoordination , 2013, AVEC@ACM Multimedia.

[30]  Carmen García-Mateo,et al.  A study of acoustic features for the classification of depressed speech , 2014, 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[31]  Man-Wai Mak,et al.  Utterance partitioning with acoustic vector resampling for GMM-SVM speaker verification , 2011, Speech Commun..

[32]  J. Markowitz,et al.  The 16-Item quick inventory of depressive symptomatology (QIDS), clinician rating (QIDS-C), and self-report (QIDS-SR): a psychometric evaluation in patients with chronic major depression , 2003, Biological Psychiatry.

[33]  Roland Göcke,et al.  An Investigation of Depressed Speech Detection: Features and Normalization , 2011, INTERSPEECH.

[34]  Anil K. Jain,et al.  Soft Biometric Traits for Personal Recognition Systems , 2004, ICBA.

[35]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[36]  Ingemar J. Cox,et al.  IEEE Signal Processing Society , 2022, IEEE Journal of Selected Topics in Signal Processing.