Towards end-2-end learning for predicting behavior codes from spoken utterances in psychotherapy conversations

Spoken language understanding tasks usually rely on pipelines involving complex processing blocks such as voice activity detection, speaker diarization and Automatic speech recognition (ASR). We propose a novel framework for predicting utterance level labels directly from speech features, thus removing the dependency on first generating transcripts, and transcription free behavioral coding. Our classifier uses a pretrained Speech-2-Vector encoder as bottleneck to generate word-level representations from speech features. This pretrained encoder learns to encode speech features for a word using an objective similar to Word2Vec. Our proposed approach just uses speech features and word segmentation information for predicting spoken utterance-level target labels. We show that our model achieves competitive results to other state-of-the-art approaches which use transcribed text for the task of predicting psychotherapy-relevant behavior codes.

[1]  Hagen Soltau,et al.  Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[2]  Andreas Stolcke,et al.  Recurrent neural network and LSTM models for lexical utterance classification , 2015, INTERSPEECH.

[3]  Shrikanth Narayanan,et al.  Improving the Prediction of Therapist Behaviors in Addiction Counseling by Exploiting Class Confusions , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Verónica Pérez-Rosas,et al.  Predicting Counselor Behaviors in Motivational Interviewing Encounters , 2017, EACL.

[5]  Vivek Srikumar,et al.  Observing Dialogue in Therapy: Categorizing and Forecasting Behavioral Codes , 2019, ACL.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Yongqiang Wang,et al.  Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Prateek Verma,et al.  Audio-linguistic Embeddings for Spoken Sentences , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Keikichi Hirose,et al.  Prosodic word boundary detection using statistical modeling of moraic fundamental frequency contours and its use for continuous speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[10]  James R. Glass,et al.  Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech , 2018, INTERSPEECH.

[11]  Christopher Joseph Pal,et al.  Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning , 2018, ICLR.

[12]  Jean-Claude Junqua,et al.  A robust algorithm for word boundary detection in the presence of noise , 1994, IEEE Trans. Speech Audio Process..

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[15]  Shrikanth S. Narayanan,et al.  Using Prosodic and Lexical Information for Learning Utterance-level Behaviors in Psychotherapy , 2018, INTERSPEECH.

[16]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[17]  Panayiotis G. Georgiou,et al.  Robust word boundary detection in spontaneous speech using acoustic and lexical cues , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Ruhi Sarikaya,et al.  Contextual domain classification in spoken language understanding systems using recurrent neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  David C. Atkins,et al.  A Comparison of Natural Language Processing Methods for Automated Coding of Motivational Interviewing. , 2016, Journal of substance abuse treatment.

[21]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[22]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Panayiotis G. Georgiou,et al.  Behavioral Coding of Therapist Language in Addiction Counseling Using Recurrent Neural Networks , 2016, INTERSPEECH.

[24]  Gokhan Tur,et al.  Spoken Language Understanding: Systems for Extracting Semantic Information from Speech , 2011 .

[25]  Yoshua Bengio,et al.  Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  David Suendermann-Oeft,et al.  Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28]  Panayiotis G. Georgiou,et al.  Multi-Label Multi-Task Deep Learning for Behavioral Coding , 2018, IEEE Transactions on Affective Computing.

[29]  Geoffrey Zweig,et al.  Recurrent neural networks for language understanding , 2013, INTERSPEECH.