Towards an Unsupervised Entrainment Distance in Conversational Speech using Deep Neural Networks

Entrainment is a known adaptation mechanism that causes interaction participants to adapt or synchronize their acoustic characteristics. Understanding how interlocutors tend to adapt to each other's speaking style through entrainment involves measuring a range of acoustic features and comparing those via multiple signal comparison methods. In this work, we present a turn-level distance measure obtained in an unsupervised manner using a Deep Neural Network (DNN) model, which we call Neural Entrainment Distance (NED). This metric establishes a framework that learns an embedding from the population-wide entrainment in an unlabeled training corpus. We use the framework for a set of acoustic features and validate the measure experimentally by showing its efficacy in distinguishing real conversations from fake ones created by randomly shuffling speaker turns. Moreover, we show real world evidence of the validity of the proposed measure. We find that high value of NED is associated with high ratings of emotional bond in suicide assessment interviews, which is consistent with prior studies.

[1]  Panayiotis G. Georgiou,et al.  Speaker2Vec: Unsupervised Learning and Adaptation of a Speaker Manifold Using Deep Neural Networks with an Evaluation on Speaker Segmentation , 2017, INTERSPEECH.

[2]  Spyros Kousidis,et al.  Convergence in Human Dialogues Time Series Analysis of Acoustic Feature , 2009 .

[3]  Athanasios Katsamanis,et al.  Quantification of prosodic entrainment in affective spontaneous spoken interactions of married couples , 2010, INTERSPEECH.

[4]  Panayiotis G. Georgiou,et al.  Behavioral signal processing for understanding (distressed) dyadic interactions: some recent developments , 2011, J-HGBU '11.

[5]  Julia Hirschberg,et al.  Speaking More Like You: Entrainment in Conversational Speech , 2011, INTERSPEECH.

[6]  Panayiotis G. Georgiou,et al.  Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and Language , 2013, Proceedings of the IEEE.

[7]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[8]  Stefan Benus,et al.  Social Aspects of Entrainment in Spoken Interaction , 2014, Cognitive Computation.

[9]  Athanasios Katsamanis,et al.  Computing vocal entrainment: A signal-derived PCA-based quantification scheme with application to affect analysis in married couple interactions , 2014, Comput. Speech Lang..

[10]  Panayiotis G. Georgiou,et al.  Complexity in Prosody: A Nonlinear Dynamical Systems Approach for Dyadic Conversations; Behavior and Outcomes in Couples Therapy , 2016, INTERSPEECH.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  P. Andersen,et al.  The exchange of nonverbal intimacy: A critical review of dyadic models , 1984 .

[13]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[14]  Julia Hirschberg,et al.  Measuring Acoustic-Prosodic Entrainment with Respect to Multiple Levels and Dimensions , 2011, INTERSPEECH.

[15]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[16]  J. Welkowitz,et al.  Interrelationships among warmth, genuineness, empathy, and temporal speech patterns in interpersonal interaction , 1973 .

[17]  J. Burgoon,et al.  Interpersonal Adaptation: Dyadic Interaction Patterns , 1995 .

[18]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[19]  Panayiotis G. Georgiou,et al.  Neural Predictive Coding Using Convolutional Neural Networks Toward Unsupervised Learning of Speaker Characteristics , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Panayiotis G. Georgiou,et al.  Complexity in Speech and its Relation to Emotional Bond in Therapist-Patient Interactions During Suicide Risk Assessment Interviews , 2017, INTERSPEECH.

[21]  James H. Watt,et al.  Dynamic patterns in communication processes , 1996 .

[22]  Haoqi Li,et al.  Unsupervised latent behavior manifold learning from acoustic features: Audio2behavior , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Panayiotis G. Georgiou,et al.  Modeling therapist empathy and vocal entrainment in drug addiction counseling , 2013, INTERSPEECH.