论文信息 - Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs

Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs

For spoken dialog systems to conduct fluid conversational interactions with users, the systems must be sensitive to turn-taking cues produced by a user. Models should be designed so that effective decisions can be made as to when it is appropriate, or not, for the system to speak. Traditional end-of-turn models, where decisions are made at utterance end-points, are limited in their ability to model fast turn-switches and overlap. A more flexible approach is to model turn-taking in a continuous manner using RNNs, where the system predicts speech probability scores for discrete frames within a future window. The continuous predictions represent generalized turn-taking behaviors observed in the training data and can be applied to make decisions that are not just limited to end-of-turn detection. In this paper, we investigate optimal speech-related feature sets for making predictions at pauses and overlaps in conversation. We find that while traditional acoustic features perform well, part-of-speech features generally perform worse than word features. We show that our current models outperform previously reported baselines.

Gabriel Skantze | Naomi Harte | Matthew Roddy

[1] Maxine Eskénazi,et al. A Finite-State Turn-Taking Model for Spoken Dialog Systems , 2009, NAACL.

[2] Gabriel Skantze,et al. Towards a General, Continuous Model of Turn-taking in Spoken Dialogue using LSTM Recurrent Neural Networks , 2017, SIGDIAL Conference.

[3] Björn Schuller,et al. Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[4] Jan-Peter de Holger N. J. Ruiter,et al. Projecting the End of a Speaker's Turn: A Cognitive Cornerstone of Conversation , 2006 .

[5] P. Kay,et al. Universals and cultural variation in turn-taking in conversation , 2009, Proceedings of the National Academy of Sciences.

[6] Anne H. Anderson,et al. The Hcrc Map Task Corpus , 1991 .

[7] Mineichi Kudo,et al. Comparison of algorithms that select features for pattern classifiers , 2000, Pattern Recognit..

[8] Ryo Ishii,et al. Online End-of-Turn Detection from Speech Based on Stacked Time-Asynchronous Sequential Networks , 2017, INTERSPEECH.

[9] E. Schegloff,et al. A simplest systematics for the organization of turn-taking for conversation , 2015 .

[10] S. Duncan,et al. Some Signals and Rules for Taking Speaking Turns in Conversations , 1972 .

[11] Juha Häkkinen,et al. Robust end-of-utterance detection for real-time speech recognition applications , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[12] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Julia Hirschberg,et al. Turn-taking cues in task-oriented dialogue , 2011, Comput. Speech Lang..

[14] Mattias Heldner,et al. Pauses, gaps and overlaps in conversations , 2010, J. Phonetics.

[15] Francisco Torreira,et al. Timing in turn-taking and its implications for processing models of language , 2015, Front. Psychol..

[16] Björn W. Schuller,et al. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[17] Maxine Eskénazi,et al. Optimizing the turn-taking behavior of task-oriented spoken dialog systems , 2012, TSLP.

[18] A. Kendon. Some functions of gaze-direction in social interaction. , 1967, Acta psychologica.

[19] A. Ichikawa,et al. An Analysis of Turn-Taking and Backchannels Based on Prosodic and Syntactic Features in Japanese Map Task Dialogs , 1998, Language and speech.

[20] Andreas Stolcke,et al. Is the speaker done yet? faster and more accurate end-of-utterance detection using prosody , 2002, INTERSPEECH.

[21] Khiet P. Truong,et al. Online detection of vocal Listener Responses with maximum latency constraints , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Francisco Torreira,et al. Listeners use intonational phrase boundaries to project turn ends in spoken interaction , 2015, J. Phonetics.