Unsupervised Speech Representation Learning for Behavior Modeling using Triplet Enhanced Contextualized Networks

Abstract Speech encodes a wealth of information related to human behavior and has been used in a variety of automated behavior recognition tasks. However, extracting behavioral information from speech remains challenging including due to inadequate training data resources stemming from the often low occurrence frequencies of specific behavioral patterns. Moreover, supervised behavioral modeling typically relies on domain-specific construct definitions and corresponding manually-annotated data, rendering generalizing across domains challenging. In this paper, we exploit the stationary properties of human behavior within an interaction and present a representation learning method to capture behavioral information from speech in an unsupervised way. We hypothesize that nearby segments of speech share the same behavioral context and hence map onto similar underlying behavioral representations. We present an encoder-decoder based Deep Contextualized Network (DCN) as well as a Triplet-Enhanced DCN (TE-DCN) framework to capture the behavioral context and derive a manifold representation, where speech frames with similar behaviors are closer while frames of different behaviors maintain larger distances. The models are trained on movie audio data and validated on diverse domains including on a couples therapy corpus and other publicly collected data (e.g., stand-up comedy). With encouraging results, our proposed framework shows the feasibility of unsupervised learning within cross-domain behavioral modeling.

[1]  Shrikanth S. Narayanan,et al.  "Rate My Therapist": Automated Detection of Empathy in Drug and Alcohol Counseling via Speech and Language Processing , 2015, PloS one.

[2]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[4]  W. Miller,et al.  Client commitment language during motivational interviewing predicts drug use outcomes. , 2003, Journal of consulting and clinical psychology.

[5]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[6]  Panayiotis G. Georgiou,et al.  Neural Predictive Coding Using Convolutional Neural Networks Toward Unsupervised Learning of Speaker Characteristics , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Thomas F. Quatieri,et al.  A review of depression and suicide risk assessment using speech analysis , 2015, Speech Commun..

[8]  Visar Berisha,et al.  Triplet Network with Attention for Speaker Diarization , 2018, INTERSPEECH.

[9]  C. Bryan,et al.  Improving the detection and prediction of suicidal behavior among military personnel by measuring suicidal beliefs: an evaluation of the Suicide Cognitions Scale. , 2014, Journal of affective disorders.

[10]  Wei Wu,et al.  GMM Supervector Based SVM with Spectral Features for Speech Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Haoqi Li,et al.  Sparsely Connected and Disjointly Trained Deep Neural Networks for Low Resource Behavioral Annotation: Acoustic Classification in Couples' Therapy , 2016, INTERSPEECH.

[12]  Panayiotis G. Georgiou,et al.  Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and Language , 2013, Proceedings of the IEEE.

[13]  Athanasios Katsamanis,et al.  Toward automating a human behavioral coding system for married couples' interactions using speech acoustic features , 2013, Speech Commun..

[14]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[15]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Shrikanth Narayanan,et al.  Predicting couple therapy outcomes based on speech acoustic features , 2017, PloS one.

[17]  David C. Atkins,et al.  The association of therapist empathy and synchrony in vocally encoded arousal. , 2014, Journal of counseling psychology.

[18]  Maja Pantic,et al.  Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[19]  T. Ollendick,et al.  Clinical child and family psychology review , 1998 .

[20]  Jinkyu Lee,et al.  High-level feature representation using recurrent neural network for speech emotion recognition , 2015, INTERSPEECH.

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Christian Biemann,et al.  Unspeech: Unsupervised Speech Context Embeddings , 2018, INTERSPEECH.

[23]  W. Miller,et al.  Enhancing motivation for change in problem drinking: a controlled comparison of two therapist styles. , 1993, Journal of consulting and clinical psychology.

[24]  Panayiotis G. Georgiou,et al.  Couples Behavior Modeling and Annotation Using Low-Resource LSTM Language Models , 2016, INTERSPEECH.

[25]  G. Margolin,et al.  The Nuts and Bolts of Behavioral Observation of Marital and Family Interaction , 1998, Clinical child and family psychology review.

[26]  David C. Atkins,et al.  Prediction of response to treatment in a randomized clinical trial of couple therapy: a 2-year follow-up. , 2009, Journal of consulting and clinical psychology.

[27]  Bryan Hartzler,et al.  Agency context and tailored training in technology transfer: a pilot evaluation of motivational interviewing training for community counselors. , 2009, Journal of substance abuse treatment.

[28]  Haoqi Li,et al.  Linking emotions to behaviors through deep transfer learning , 2019, PeerJ Comput. Sci..

[29]  Panayiotis G. Georgiou,et al.  Towards an Unsupervised Entrainment Distance in Conversational Speech using Deep Neural Networks , 2018, INTERSPEECH.

[30]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[31]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[32]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Panayiotis G. Georgiou,et al.  A dynamic model for behavioral analysis of couple interactions using acoustic features , 2015, INTERSPEECH.

[34]  Björn W. Schuller,et al.  AVEC 2012: the continuous audio/visual emotion challenge , 2012, ICMI '12.

[35]  Yanning Zhang,et al.  Hybrid Deep Neural Network--Hidden Markov Model (DNN-HMM) Based Speech Emotion Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[36]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[37]  Yoshua Bengio,et al.  Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks , 2019, INTERSPEECH.

[38]  EppsJulien,et al.  A review of depression and suicide risk assessment using speech analysis , 2015 .

[39]  Shrikanth Narayanan,et al.  Automatic Prediction of Suicidal Risk in Military Couples Using Multimodal Interaction Cues from Couples Conversations , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[41]  David C. Atkins,et al.  Couple and individual adjustment for 2 years following a randomized clinical trial comparing traditional versus integrative behavioral couple therapy. , 2006, Journal of consulting and clinical psychology.

[42]  David C. Atkins,et al.  Traditional versus integrative behavioral couple therapy for significantly and chronically distressed married couples. , 2004, Journal of consulting and clinical psychology.

[43]  Marco Tagliasacchi,et al.  Self-supervised audio representation learning for mobile devices , 2019, ArXiv.

[44]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..