Natural Language Processing Methods for Acoustic and Landmark Event-Based Features in Speech-Based Depression Detection

The processing of speech as an explicit sequence of events is common in automatic speech recognition (linguistic events), but has received relatively little attention in paralinguistic speech classification despite its potential for characterizing broad acoustic event sequences. This paper proposes a framework for analyzing speech as a sequence of acoustic events, and investigates its application to depression detection. In this framework, acoustic space regions are tokenized to ‘words’ representing speech events at fixed or irregular intervals. This tokenization allows the exploitation of acoustic word features using proven natural language processing methods. A key advantage of this framework is its ability to accommodate heterogeneous event types: herein we combine acoustic words and speech landmarks, which are articulation-related speech events. Another advantage is the option to fuse such heterogeneous events at various levels, including the embedding level. Evaluation of the proposed framework on both controlled laboratory-grade supervised audio recordings as well as unsupervised self-administered smartphone recordings highlight the merits of the proposed framework across both datasets, with the proposed landmark-dependent acoustic words achieving improvements in F1(depressed) of up to 15% and 13% for SH2-FS and DAIC-WOZ respectively, relative to acoustic speech baseline approaches.

[1]  Roland Göcke,et al.  Investigating Word Affect Features and Fusion of Probabilistic Predictions Incorporating Uncertainty in AVEC 2017 , 2017, AVEC@ACM Multimedia.

[2]  Andrew T. Campbell,et al.  Next-generation psychiatric assessment: Using smartphone sensors to monitor behavior and mental health. , 2015, Psychiatric rehabilitation journal.

[3]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[4]  Zhaocheng Huang,et al.  A PLLR and multi-stage Staircase Regression framework for speech-based emotion prediction , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Fabien Ringeval,et al.  Bags in Bag: Generating Context-Aware Bags for Tracking Emotions from Speech , 2018, INTERSPEECH.

[6]  Christian Poellabauer,et al.  Topic Modeling Based Multi-modal Depression Detection , 2017, AVEC@ACM Multimedia.

[7]  Janet Slifka,et al.  A LANDMARK-BASED MODEL OF SPEECH PERCEPTION: HISTORY AND RECENT DEVELOPMENTS , 2004 .

[8]  Oscar Mayora-Ibarra,et al.  Mobile phones as medical devices in mental disorder treatment: an overview , 2014, Personal and Ubiquitous Computing.

[9]  Thomas F. Quatieri,et al.  A review of depression and suicide risk assessment using speech analysis , 2015, Speech Commun..

[10]  J. Onnela,et al.  High Potential But Limited Evidence: Using Voice Data From Smartphones to Monitor and Diagnose Mood Disorders , 2017, Psychiatric rehabilitation journal.

[11]  Zhaocheng Huang,et al.  Investigation of Speech Landmark Patterns for Depression Detection , 2019 .

[12]  Björn W. Schuller,et al.  Multimodal Bag-of-Words for Cross Domains Sentiment Analysis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Stefan Scherer,et al.  Multimodal assessment of depression from behavioral signals , 2018, The Handbook of Multimodal-Multisensor Interfaces, Volume 2.

[14]  Thomas F. Quatieri,et al.  Phonologically-based biomarkers for major depressive disorder , 2011, EURASIP J. Adv. Signal Process..

[15]  Jeffrey F. Cohn,et al.  Detecting Depression Severity from Vocal Prosody , 2013, IEEE Transactions on Affective Computing.

[16]  Dongmei Jiang,et al.  Decision Tree Based Depression Classification from Audio Video and Language Information , 2016, AVEC@ACM Multimedia.

[17]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[18]  Harriet J. Fell,et al.  SpeechMark: Landmark Detection Tool for Speech Analysis , 2012, INTERSPEECH.

[19]  Yunhong Wang,et al.  DepAudioNet: An Efficient Deep Model for Audio based Depression Classification , 2016, AVEC@ACM Multimedia.

[20]  Joel MacAuslan,et al.  Toward clinical application of landmark-based speech analysis: Landmark expression in normal adult speech. , 2017, The Journal of the Acoustical Society of America.

[21]  Klaus R. Scherer,et al.  Vocal indicators of mood change in depression , 1996 .

[22]  Emily Mower Provost,et al.  The PRIORI Emotion Dataset: Linking Mood to Emotion Detected In-the-Wild , 2018, INTERSPEECH.

[23]  Thomas F. Quatieri,et al.  Vocal and Facial Biomarkers of Depression based on Motor Incoordination and Timing , 2014, AVEC '14.

[24]  Thomas F. Quatieri,et al.  Detecting Depression using Vocal, Facial and Semantic Communication Cues , 2016, AVEC@ACM Multimedia.

[25]  Michael Sharpe,et al.  The prevalence of depression in general hospital inpatients: a systematic review and meta-analysis of interview-based studies , 2018, Psychological Medicine.

[26]  Michael Cannizzaro,et al.  Voice acoustical measurement of the severity of major depression , 2004, Brain and Cognition.

[27]  Jian Huang,et al.  Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks , 2018, AVEC@MM.

[28]  Björn W. Schuller,et al.  Deep Unsupervised Representation Learning for Abnormal Heart Sound Classification , 2018, 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[29]  Zhaocheng Huang,et al.  An Investigation of Partition-Based and Phonetically-Aware Acoustic Features for Continuous Emotion Prediction from Speech , 2020, IEEE Transactions on Affective Computing.

[30]  Zhaocheng Huang,et al.  Depression Detection from Short Utterances via Diverse Smartphones in Natural Environmental Conditions , 2018, INTERSPEECH.

[31]  Roland Göcke,et al.  Diagnosis of depression by behavioural signals: a multimodal approach , 2013, AVEC@ACM Multimedia.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Carmen García-Mateo,et al.  Depression Detection Using Automatic Transcriptions of De-Identified Speech , 2017, INTERSPEECH.

[34]  Michael Wagner,et al.  Multimodal assistive technologies for depression diagnosis and monitoring , 2013, Journal on Multimodal User Interfaces.

[35]  Vidhyasaharan Sethu,et al.  Analysis of acoustic space variability in speech affected by depression , 2015, Speech Commun..

[36]  Fabien Ringeval,et al.  AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition , 2018, AVEC@MM.

[37]  Fabien Ringeval,et al.  AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge , 2016, AVEC@ACM Multimedia.

[38]  Fabien Ringeval,et al.  At the Border of Acoustics and Linguistics: Bag-of-Audio-Words for the Recognition of Emotions in Speech , 2016, INTERSPEECH.

[39]  Zhaocheng Huang,et al.  Speech Landmark Bigrams for Depression Detection from Naturalistic Smartphone Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[41]  Xin Rong,et al.  word2vec Parameter Learning Explained , 2014, ArXiv.

[42]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[43]  Vidhyasaharan Sethu,et al.  Speaker variability in speech based emotion models - Analysis and normalisation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  T. Robbins,et al.  Emotional bias and inhibitory control processes in mania and depression , 1999, Psychological Medicine.

[45]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[46]  Fan Yang,et al.  Depression Assessment by Fusing High and Low Level Features from Audio, Video, and Text , 2016, AVEC@ACM Multimedia.

[47]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[48]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[49]  Bin Ma,et al.  Spoken Language Recognition: From Fundamentals to Practice , 2013, Proceedings of the IEEE.

[50]  Albert A. Rizzo,et al.  Self-Reported Symptoms of Depression and PTSD Are Associated with Reduced Vowel Space in Screening Interviews , 2016, IEEE Transactions on Affective Computing.

[51]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[52]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[53]  Dongmei Jiang,et al.  Multimodal Measurement of Depression Using Deep Learning Models , 2017, AVEC@ACM Multimedia.

[54]  R. Spitzer,et al.  The PHQ-9: validity of a brief depression severity measure. , 2001, Journal of general internal medicine.

[55]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[56]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[57]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.