Segmentation and disfluency removal for conversational speech translation

In this paper we focus on the effect of on-line speech segmentation and disfluency removal methods on conversational speech translation. In a real-time conversational speech to speech translation system, on-line segmentation of speech is required to avoid latency beyond few seconds. While sentential unit segmentation and disfluency removal have been heavily studied mainly for off-line speech processing, to the best of our knowledge, the combined effect of these tasks on conversational speech translation has not been investigated. Furthermore, optimization of performance given maximum allowable system latency to enable a conversation is a newer problem for these tasks. We show that the conventional assumption of doing segmentation followed by disfluency removal is not the best practice. We propose a new approach to do simple-disfluency removal followed by segmentation and then by complex-disfluency removal. The proposed approach shows a significant gain on translation performance of up to 3 Bleu points with only 6 second latency to look ahead, using state-ofthe art machine translation and speech recognition systems. Index Terms: speech translation, disfluency removal, segmentation, sentence units, speech processing

[1]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[2]  Elisabeth Schriberg,et al.  Preliminaries to a Theory of Speech Disfluencies , 1994 .

[3]  Andreas Stolcke,et al.  Enriching speech recognition with automatic detection of sentence boundaries and disfluencies , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Mary P. Harper,et al.  Reranking for Sentence Boundary Detection in Conversational Speech , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[5]  Gökhan Tür,et al.  Automatic detection of sentence boundaries and disfluencies based on recognized words , 1998, ICSLP.

[6]  Robert C. Moore,et al.  Faster beam-search decoding for phrasal statistical machine translation , 2007, MTSUMMIT.

[7]  Gökhan Tür,et al.  Co-training using prosodic and lexical information for sentence segmentation , 2007, INTERSPEECH.

[8]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[10]  Andreas Stolcke,et al.  A prosody only decision-tree model for disfluency detection , 1997, EUROSPEECH.

[11]  Dilek Z. Hakkani-Tür,et al.  Efficient sentence segmentation using syntactic features , 2008, 2008 IEEE Spoken Language Technology Workshop.

[12]  Andreas Stolcke,et al.  Statistical language modeling for speech disfluencies , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  Geoffrey Zweig,et al.  Maximum entropy model for punctuation annotation from speech , 2002, INTERSPEECH.

[14]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[15]  Tanja Schultz,et al.  Improving spoken language translation by automatic disfluency removal: evidence from conversational speech transcripts , 2007, MTSUMMIT.

[16]  Gökhan Tür,et al.  Automatic disfluency removal for improving spoken language translation , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Dilek Z. Hakkani-Tür,et al.  The ICSI+ multilingual sentence segmentation system , 2006, INTERSPEECH.

[18]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[20]  Dilek Z. Hakkani-Tür,et al.  Improving speech translation with automatic boundary prediction , 2007, INTERSPEECH.

[21]  Elizabeth Shriberg,et al.  Automatic dialog act segmentation and classification in multiparty meetings , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[22]  Richard M. Schwartz,et al.  A Lexically-Driven Algorithm for Disfluency Detection , 2004, NAACL.

[23]  Dong Yu,et al.  Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition , 2010 .

[24]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[25]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[26]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..