Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs

For spoken dialog systems to conduct fluid conversational interactions with users, the systems must be sensitive to turn-taking cues produced by a user. Models should be designed so that effective decisions can be made as to when it is appropriate, or not, for the system to speak. Traditional end-of-turn models, where decisions are made at utterance end-points, are limited in their ability to model fast turn-switches and overlap. A more flexible approach is to model turn-taking in a continuous manner using RNNs, where the system predicts speech probability scores for discrete frames within a future window. The continuous predictions represent generalized turn-taking behaviors observed in the training data and can be applied to make decisions that are not just limited to end-of-turn detection. In this paper, we investigate optimal speech-related feature sets for making predictions at pauses and overlaps in conversation. We find that while traditional acoustic features perform well, part-of-speech features generally perform worse than word features. We show that our current models outperform previously reported baselines.

[1]  Maxine Eskénazi,et al.  A Finite-State Turn-Taking Model for Spoken Dialog Systems , 2009, NAACL.

[2]  Gabriel Skantze,et al.  Towards a General, Continuous Model of Turn-taking in Spoken Dialogue using LSTM Recurrent Neural Networks , 2017, SIGDIAL Conference.

[3]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[4]  Jan-Peter de Holger N. J. Ruiter,et al.  Projecting the End of a Speaker's Turn: A Cognitive Cornerstone of Conversation , 2006 .

[5]  P. Kay,et al.  Universals and cultural variation in turn-taking in conversation , 2009, Proceedings of the National Academy of Sciences.

[6]  Anne H. Anderson,et al.  The Hcrc Map Task Corpus , 1991 .

[7]  Mineichi Kudo,et al.  Comparison of algorithms that select features for pattern classifiers , 2000, Pattern Recognit..

[8]  Ryo Ishii,et al.  Online End-of-Turn Detection from Speech Based on Stacked Time-Asynchronous Sequential Networks , 2017, INTERSPEECH.

[9]  E. Schegloff,et al.  A simplest systematics for the organization of turn-taking for conversation , 2015 .

[10]  S. Duncan,et al.  Some Signals and Rules for Taking Speaking Turns in Conversations , 1972 .

[11]  Juha Häkkinen,et al.  Robust end-of-utterance detection for real-time speech recognition applications , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[12]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Julia Hirschberg,et al.  Turn-taking cues in task-oriented dialogue , 2011, Comput. Speech Lang..

[14]  Mattias Heldner,et al.  Pauses, gaps and overlaps in conversations , 2010, J. Phonetics.

[15]  Francisco Torreira,et al.  Timing in turn-taking and its implications for processing models of language , 2015, Front. Psychol..

[16]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[17]  Maxine Eskénazi,et al.  Optimizing the turn-taking behavior of task-oriented spoken dialog systems , 2012, TSLP.

[18]  A. Kendon Some functions of gaze-direction in social interaction. , 1967, Acta psychologica.

[19]  A. Ichikawa,et al.  An Analysis of Turn-Taking and Backchannels Based on Prosodic and Syntactic Features in Japanese Map Task Dialogs , 1998, Language and speech.

[20]  Andreas Stolcke,et al.  Is the speaker done yet? faster and more accurate end-of-utterance detection using prosody , 2002, INTERSPEECH.

[21]  Khiet P. Truong,et al.  Online detection of vocal Listener Responses with maximum latency constraints , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Francisco Torreira,et al.  Listeners use intonational phrase boundaries to project turn ends in spoken interaction , 2015, J. Phonetics.