Turn-taking and affirmative cue words in task-oriented dialogue

As interactive voice response systems spread at a rapid pace, providing an increasingly more complex functionality, it is becoming clear that the challenges of such systems are not solely associated to their synthesis and recognition capabilities. Rather, issues such as the coordination of turn exchanges between system and user, or the correct generation and understanding of words that may convey multiple meanings, appear to play an important role in system usability. This thesis explores those two issues in the Columbia Games Corpus, a collection of spontaneous task-oriented dialogues in Standard American English. We provide evidence of the existence of seven turn-yielding cues—prosodic, acoustic and syntactic events strongly associated with conversational turn endings and show that the likelihood of a turn-taking attempt from the interlocutor increases linearly with the number of cues conjointly displayed by the speaker. We present similar results related to six backchannel-inviting cues—events that invite the interlocutor to produce a short utterance conveying continued attention. Additionally, we describe a series of studies of affirmative cue words—a family of cue words such as okay or alright that speakers use frequently in conversation for several purposes: for acknowledging what the interlocutor has said, or for cueing the start of a new topic, among others. We find differences in the acoustic/prosodic realization of such functions, but observe that contextual information figures prominently in human disambiguation of these words. We also conduct machine learning experiments to explore the automatic classification of affirmative cue words. Finally, we examine a novel measure of speaker entrainment related to the usage of these words, showing its association with task success and dialogue coordination.

[1]  Julia Hirschberg,et al.  Affirmative Cue Words in Task-Oriented Dialogue , 2012, CL.

[2]  Julia Hirschberg,et al.  Pragmatic aspects of temporal accommodation in turn-taking , 2011, Journal of Pragmatics.

[3]  Julia Hirschberg,et al.  Turn-taking cues in task-oriented dialogue , 2011, Comput. Speech Lang..

[4]  Julia Hirschberg,et al.  Entrainment in Speech Preceding Backchannels. , 2011, ACL.

[5]  Julia Hirschberg,et al.  Acoustic and Prosodic Correlates of Social Behavior , 2011, INTERSPEECH.

[6]  A. Gravano Modelado de la mimetización entre interlocutores para mejorar la naturalidad de sistemas de diálogo hablado , 2010 .

[7]  Mattias Heldner,et al.  Very short utterances in conversation , 2010 .

[8]  Julia Hirschberg,et al.  Turn-Yielding Cues in Task-Oriented Dialogue , 2009, SIGDIAL Conference.

[9]  Michiel Bacchiani,et al.  Restoring punctuation and capitalization in transcribed speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Julia Hirschberg,et al.  Backchannel-inviting cues in task-oriented dialogue , 2009, INTERSPEECH.

[11]  David Schlangen,et al.  Towards Incremental End-of-Utterance Detection in Dialogue Systems , 2008, COLING.

[12]  Maxine Eskénazi,et al.  Optimizing Endpointing Thresholds using Dialogue Features in a Spoken Dialogue System , 2008, SIGDIAL Workshop.

[13]  Julia Hirschberg,et al.  High Frequency Word Entrainment in Spoken Dialogue , 2008, ACL.

[14]  Timo Baumann Simulating Spoken Dialogue With a Focus on Realistic Turn-Taking , 2008 .

[15]  Julia Hirschberg,et al.  The Effect of Contour Type and Epistemic Modality on the Assessment of Speaker Certainty , 2008 .

[16]  Lakhmi C. Jain,et al.  Introduction to Bayesian Networks , 2008 .

[17]  Catherine Lai Prosodic Cues for Backchannels and Short Questions: Really? , 2008 .

[18]  Johanna D. Moore,et al.  Predicting Success in Dialogue , 2007, ACL.

[19]  Julia Hirschberg,et al.  Prosody, emotions, and... 'whatever' , 2007, INTERSPEECH.

[20]  Diane J. Litman,et al.  Automatically measuring lexical and acoustic/prosodic convergence in tutorial dialog corpora , 2007, SLaTE.

[21]  Julia Hirschberg,et al.  Intonational Overload: Uses of the Downstepped (H* !H* L- L%) Contour in Read and Spontaneous Speech , 2007 .

[22]  Julia Hirschberg,et al.  The Prosody of Backchannels in American English , 2007 .

[23]  Mark Liberman,et al.  TOWARDS AN INTEGRATED UNDERSTANDING OF SPEECH OVERLAPS IN CONVERSATION , 2007 .

[24]  Shira Mitchell,et al.  Classification of discourse functions of affirmative words in spoken dialogue , 2007, INTERSPEECH.

[25]  Julia Hirschberg,et al.  On the role of context and prosody in the interpretation of ‘okay’ , 2007, ACL.

[26]  Johanna D. Moore,et al.  Computational Modelling of Structural Priming in Dialogue , 2006, NAACL.

[27]  D. Goleman Social Intelligence: The New Science of Human Relationships , 2006 .

[28]  David Schlangen,et al.  From reaction to prediction: experiments with computational models of turn-taking , 2006, INTERSPEECH.

[29]  Maxine Eskénazi,et al.  Doing research on a deployed spoken dialogue system: one year of let's go! experience , 2006, INTERSPEECH.

[30]  Jeremy P. Spinrad,et al.  Polynomial time recognition of unit circular-arc graphs , 2006, J. Algorithms.

[31]  Julia Hirschberg,et al.  Effect of genre, speaker, and word class on the realization of given and new information , 2006, INTERSPEECH.

[32]  David G. Novick,et al.  Root causes of lost time and user stress in a simple dialog system , 2005, INTERSPEECH.

[33]  Ruhi Sarikaya,et al.  Rapid language model development using external resources for new spoken dialog domains , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[34]  Agustin Gravano Downstepped Contours in the Given/New Distinction: Presentation Powerpoint Slides , 2005 .

[35]  Mattias Heldner,et al.  Utterance segmentation and turn-taking in spoken dialogue systems , 2005 .

[36]  Richard Ogden Non-modal voice quality and turn-taking in Finnish , 2004 .

[37]  James D Garnett,et al.  Perceptual evaluation of voice quality and its correlation with acoustic measurements. , 2004, Journal of voice : official journal of the Voice Foundation.

[38]  M. Pickering,et al.  Toward a mechanistic psychology of dialogue , 2004, Behavioral and Brain Sciences.

[39]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[40]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[41]  Andrei Popescu-Belis,et al.  Towards Automatic Identification of Discourse Markers in Dialogs: The Case of Like , 2004, SIGDIAL Workshop.

[42]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[43]  A. Siegel,et al.  Keeping the Floor in Multiparty Conversations: Intonation, Syntax, and Pause , 2003 .

[44]  Alexander I. Rudnicky,et al.  Ravenclaw: dialog management using hierarchical task decomposition and an expectation agenda , 2003, INTERSPEECH.

[45]  Andreas Stolcke,et al.  A prosody-based approach to end-of-utterance detection that does not require speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[46]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[47]  Jayme Luiz Szwarcfiter,et al.  On a conjecture concerning helly circle graphs , 2003 .

[48]  Andrei Popescu-Belis,et al.  What are discourse markers ? , 2003 .

[49]  Sharon L. Oviatt,et al.  Amplitude convergence in children²s conversational speech with animated personas , 2002, INTERSPEECH.

[50]  Andreas Stolcke,et al.  Is the speaker done yet? faster and more accurate end-of-utterance detection using prosody , 2002, INTERSPEECH.

[51]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[52]  Richard Ogden Creaky voice and turn-taking in Finnish. , 2002 .

[53]  Giacomo Mauro DAriano The Journal of Personality and Social Psychology. , 2002 .

[54]  The intersection between some subclasses of circular-arc and circle graphs , 2002 .

[55]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[56]  Anne Wichmann,et al.  Melodic Cues to Turn-Taking in English: Evidence from Perception , 2001, SIGDIAL Workshop.

[57]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[58]  Eugene Charniak,et al.  Edit Detection and Parsing for Transcribed Speech , 2001, NAACL.

[59]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[60]  Nigel G. Ward,et al.  Prosodic features which cue back-channel responses in English and Japanese , 2000 .

[61]  Andreas Stolcke,et al.  Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[62]  Philipp Koehn,et al.  Improving intonational phrasing with syntactic information , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[63]  T. Chartrand,et al.  The chameleon effect: the perception-behavior link and social interaction. , 1999, Journal of personality and social psychology.

[64]  Matthew P. Aylett,et al.  An analysis of the timing of turn-taking in a corpus of goal-oriented dialogue , 1998, ICSLP.

[65]  Daniel Jurafsky,et al.  Lexical, Prosodic, and Syntactic Cues for Dialog Acts , 1998 .

[66]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[67]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[68]  Julia Hirschberg,et al.  A Prosodic Analysis of Discourse Segments in Direction-Giving Monologues , 1996, ACL.

[69]  Cecilia E. Ford,et al.  Interaction and grammar: Interactional units in conversation: syntactic, intonational, and pragmatic resources for the management of turns , 1996 .

[70]  J. Oates,et al.  Performance effects on the voices of 10 choral tenors: acoustic and perceptual findings. , 1996, Journal of voice : official journal of the Voice Foundation.

[71]  Jacqueline C. Kowtko,et al.  The function of intonation in task-oriented dialogue , 1996 .

[72]  Susan E. Brennan,et al.  LEXICAL ENTRAINMENT IN SPONTANEOUS DIALOG , 1996 .

[73]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[74]  Steve Young,et al.  The HTK book , 1995 .

[75]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[76]  J. A. Edwards,et al.  Talking data : transcription and coding in discourse research , 1995 .

[77]  Rebecca J. Passonneau,et al.  Combining Multiple Knowledge Sources for Discourse Segmentation , 1995, ACL.

[78]  Julia Hirschberg,et al.  Evaluation of prosodic transcription labeling reliability in the tobi framework , 1994, ICSLP.

[79]  Diane J. Litman,et al.  Classifying Cue Phrases in Text and Speech Using Machine Learning , 1994, AAAI.

[80]  M. Studdert-Kennedy Hand and Mind: What Gestures Reveal About Thought. , 1994 .

[81]  Julia Hirschberg,et al.  Empirical Studies on the Disambiguation of Cue Phrases , 1993, Comput. Linguistics.

[82]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[83]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[84]  Colin W. Wightman,et al.  Segmental durations in the vicinity of prosodic phrase boundaries. , 1992, The Journal of the Acoustical Society of America.

[85]  Stefanie Shattuck-Hufnagel,et al.  The Use of Prosody in Syntactic Disambiguation , 1991, HLT.

[86]  Julia Hirschberg,et al.  Disambiguating Cue Phrases in Text and Speech , 1990, COLING.

[87]  Julia Hirschberg,et al.  Accent and Discourse Context: Assigning Pitch Accent in Synthetic Speech , 1990, AAAI.

[88]  D. Childers,et al.  Acoustic correlates of vocal quality. , 1990, Journal of speech and hearing research.

[89]  Philip R. Cohen,et al.  The Meaning of Intonational Contours in the Interpretation of Discourse , 1990 .

[90]  Christine Iacobucci,et al.  Rachel Reichman, Getting computers to talk like you and me: Discourse context, focus, and semantics (an ATN model) . Cambridge, Mass.: MIT Press, 1985. Pp. xiii + 221. , 1989 .

[91]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[92]  R. Lathe Phd by thesis , 1988, Nature.

[93]  Julia Hirschberg,et al.  Now Let’s Talk About Now; Identifying Cue Phrases Intonationally , 1987, ACL.

[94]  James F. Allen,et al.  A Plan Recognition Model for Subdialogues in Conversations , 1987, Cogn. Sci..

[95]  J. Pierrehumbert The phonology and phonetics of English intonation , 1987 .

[96]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[97]  Rachel Reichman,et al.  Getting computers to talk like you and me , 1985 .

[98]  Robin Cohen,et al.  A Computational Theory of the Function of Clue Words in Argument Understanding , 1984, ACL.

[99]  Daniel Schaffer,et al.  The role of intonation as a cue to turn taking in conversation , 1983 .

[100]  G. Beattie Turn-taking and interruption in political interviews: Margaret Thatcher and Jim Callaghan compared and contrasted , 1982 .

[101]  G. Beattie The regulation of speaker turns in face-to-face conversation: Some implications for conversation in sound-only communication channels , 1981 .

[102]  C. Goodwin Conversational Organization: Interaction Between Speakers and Hearers , 1981 .

[103]  Gillian Brown,et al.  Questions of intonation , 1980 .

[104]  N. Ferguson,et al.  Simultaneous speech, interruptions and dominance , 1977 .

[105]  Donald W. Fiske,et al.  Face-to-face interaction: Research, methods, and theory , 1977 .

[106]  D. Fry Simple Reaction-Times to Speech and Non-Speech Stimuli , 1975, Cortex.

[107]  S. Duncan Interaction Units during Speaking Turns in Dyadic, Face-to-Face Conversations , 1975 .

[108]  E. Schegloff,et al.  A simplest systematics for the organization of turn-taking for conversation , 1974 .

[109]  S. Duncan,et al.  On the structure of speaker–auditor interaction during speaking turns , 1974, Language in Society.

[110]  Starkey Duncan Toward a Grammar for Dyadic Conversation , 1973 .

[111]  S. Duncan,et al.  Some Signals and Rules for Taking Speaking Turns in Conversations , 1972 .

[112]  A. Kendon Some Relationships Between Body Motion and Speech , 1972 .

[113]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[114]  V. Yngve On getting a word in edgewise , 1970 .

[115]  Verzekeren Naar Sparen,et al.  Cambridge , 1969, Humphrey Burton: In My Own Time.

[116]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[117]  Joaquin Miller,et al.  Columbus , 1910 .

[118]  Manchester , 1906 .