Neural Representations of Dialogical History for Improving Upcoming Turn Acoustic Parameters Prediction

Predicting the acoustic and linguistic parameters of an upcoming conversational turn is important for dialogue systems aiming to include low-level adaptation with the user. It is known that during an interaction speakers could influence each other speech production. However, the precise dynamics of the phenomena is not well-established, especially in the context of natural conversations. We developed a model based on an RNN architecture that predicts speech variables (Energy, F0 range and Speech Rate) of the upcoming turn using a representation vector describing speech information of previous turns. We compare the prediction performances when using a dialogical history (from both participants) vs. monological history (from only upcoming turn’s speaker). We found that the information contained in previous turns produced by both the speaker and his interlocutor reduce the error in predicting current acoustic target variable. In addition the error in prediction decreases as increases the number of previous turns taken into account.

[1]  Antje Schweitzer,et al.  Exemplar Dynamics in Phonetic Convergence of Speech Rate , 2016, INTERSPEECH.

[2]  Ting Wang,et al.  Absolute and relative entrainment in Mandarin conversations , 2015, ICPhS.

[3]  Mark T. Keane,et al.  The effect of soft, modal and loud voice levels on entrainment in noisy conditions , 2015, INTERSPEECH.

[4]  Rivka Levitan,et al.  Looking for Structure in Lexical and Acoustic-Prosodic Entrainment Behaviors , 2018, NAACL.

[5]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Mark Steedman,et al.  The NXT-format Switchboard Corpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue , 2010, Lang. Resour. Evaluation.

[7]  Uriel Cohen Priva,et al.  Convergence is predicted by particular interlocutors, not speakers , 2019 .

[8]  Brian Vaughan,et al.  Prosodic Synchrony in Co-Operative Task-Based Dialogues: A Measure of Agreement and Disagreement , 2011, INTERSPEECH.

[9]  Alexis Nasr,et al.  Annotation en Actes de Dialogue pour les Conversations d’Assistance en Ligne (Dialog Acts Annotations for Online Chats) , 2018, JEPTALNRECITAL.

[10]  M. Natale CONVERGENCE OF MEAN VOCAL INTENSITY IN DYADIC COMMUNICATION AS A FUNCTION OF SOCIAL DESIRABILITY , 1975 .

[11]  Julia Hirschberg,et al.  Entrainment in Speech Preceding Backchannels. , 2011, ACL.

[12]  Uriel Cohen Priva,et al.  Distinct behaviors in convergence across measures , 2018, CogSci.

[13]  Uriel Cohen Priva,et al.  Converging to the baseline: Corpus evidence for convergence in speech rate to interlocutor's baseline. , 2017, The Journal of the Acoustical Society of America.

[14]  Christoph Bartneck,et al.  Persistent Lexical Entrainment in HRI , 2017, 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI.

[15]  R. Street Speech Convergence and Speech Evaluation in Fact-Finding Interviews , 1984 .

[16]  Julia Hirschberg,et al.  Three ToBI-based measures of prosodic entrainment and their correlations with speaker engagement , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[17]  C. Sanker Comparison of Phonetic Convergence in Multiple Measures , 2016 .

[18]  Juan Manuel Pérez,et al.  Disentrainment may be a Positive Thing: A Novel Measure of Unsigned Acoustic-Prosodic Synchrony, and its Relation to Speaker Engagement , 2016, INTERSPEECH.

[19]  Nick Campbell,et al.  Measuring Dynamics of Mimicry by Means of Prosodic Cues in Conversational Speech , 2011, ICPhS.

[20]  Julia Hirschberg,et al.  Implementing Acoustic-Prosodic Entrainment in a Conversational Avatar , 2016, INTERSPEECH.

[21]  Antje Schweitzer,et al.  Convergence of articulation rate in spontaneous speech , 2013, INTERSPEECH.

[22]  Spyros Kousidis,et al.  Convergence in Human Dialogues Time Series Analysis of Acoustic Feature , 2009 .

[23]  H. Lane,et al.  The Lombard Sign and the Role of Hearing in Speech , 1971 .

[24]  Björn Schuller,et al.  openSMILE:): the Munich open-source large-scale multimedia feature extractor , 2015, ACMMR.

[25]  Maxine Eskénazi,et al.  Automated two-way entrainment to improve spoken dialog system performance , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[27]  Stefan Kopp,et al.  Exploring the Alignment Space – Lexical and Gestural Alignment with Real and Virtual Humans , 2015, Front. ICT.

[28]  Dan Jurafsky,et al.  Dialog Act Modeling for Conversational Speech , 1998 .

[29]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[30]  Julia Hirschberg,et al.  Measuring Acoustic-Prosodic Entrainment with Respect to Multiple Levels and Dimensions , 2011, INTERSPEECH.

[31]  Dirk Heylen,et al.  Measuring prosodic alignment in cooperative task-based conversations , 2012, INTERSPEECH.