Jointly recognizing multi-speaker conversations

We suggest an approach to speech recognition where multiple sides of a conversation in a dialog or meeting are processed and decoded jointly rather than independently. We moreover introduce a practical implementation of this approach that demonstrates both language model perplexity and speech recognition word error rate improvements in conversational telephone speech. Specifically, we show that such benefits can be had if a n-gram language model, in addition to conditioning on immediately preceding words in an utterance, is also allowed to condition on the estimated dialog-act of the immediately preceding utterance of an alternate speaker.

[1]  Jeff A. Bilmes,et al.  Backoff Model Training using Partially Observed Data: Application to Dialog Act Tagging , 2006, NAACL.

[2]  M. Pickering,et al.  Why is conversation so easy? , 2004, Trends in Cognitive Sciences.

[3]  Hui Lin,et al.  Spoken keyword spotting via multi-lattice alignment , 2008, INTERSPEECH.

[4]  A. Koller,et al.  Speech Acts: An Essay in the Philosophy of Language , 1969 .

[5]  Jeff A. Bilmes,et al.  Dialog act tagging using graphical models , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[6]  Jeff A. Bilmes,et al.  Multi-Speaker Language Modeling , 2004, HLT-NAACL.

[7]  J. Bargh,et al.  The perception–behavior expressway: Automatic effects of social perception on social behavior. , 2001 .

[8]  Tanja Schultz,et al.  Optimizing sentence segmentation for spoken language translation , 2007, INTERSPEECH.

[9]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Dan Jurafsky,et al.  Dialog Act Modeling for Conversational Speech , 1998 .

[11]  M. Pickering,et al.  Toward a mechanistic psychology of dialogue , 2004, Behavioral and Brain Sciences.

[12]  Jeff A. Bilmes,et al.  Towards the automated social analysis of situated speech data , 2008, UbiComp.

[13]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  John Local,et al.  Variable domains and variable relevance: interpreting phonetic exponents , 2003, J. Phonetics.

[15]  Jeff A. Bilmes,et al.  GRAPHICAL MODEL REPRESENTATIONS OF WORD LATTICES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[16]  Hui Lin,et al.  Improving multi-lattice alignment based spoken keyword spotting , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[18]  Jeff A. Bilmes,et al.  Factored Language Models and Generalized Parallel Backoff , 2003, NAACL.

[19]  A. Stolcke,et al.  Dialog act modelling for conversational speech , 1998 .

[20]  Andreas Stolcke,et al.  From switchboard to meetings: development of the 2004 ICSI-SRI-UW meeting recognition system , 2004, INTERSPEECH.

[21]  H. H. Clark,et al.  Conceptual pacts and lexical choice in conversation. , 1996, Journal of experimental psychology. Learning, memory, and cognition.

[22]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .