Enriching machine-mediated speech-to-speech translation using contextual information

Conventional approaches to speech-to-speech (S2S) translation typically ignore key contextual information such as prosody, emphasis, discourse state in the translation process. Capturing and exploiting such contextual information is especially important in machine-mediated S2S translation as it can serve as a complementary knowledge source that can potentially aid the end users in improved understanding and disambiguation. In this work, we present a general framework for integrating rich contextual information in S2S translation. We present novel methodologies for integrating source side context in the form of dialog act (DA) tags, and target side context using prosodic word prominence. We demonstrate the integration of the DA tags in two different statistical translation frameworks, phrase-based translation and a bag-of-words lexical choice model. In addition to producing interpretable DA annotated target language translations, we also obtain significant improvements in terms of automatic evaluation metrics such as lexical selection accuracy and BLEU score. Our experiments also indicate that finer representation of dialog information such as yes-no questions, wh-questions and open questions are the most useful in improving translation quality. For target side enrichment, we employ factored translation models to integrate the assignment and transfer of prosodic word prominence (pitch accents) during translation. The factored translation models provide significant improvement in assignment of correct pitch accents to the target words in comparison with a post-processing approach. Our framework is suitable for integrating any word or utterance level contextual information that can be reliably detected (recognized) from speech and/or text.

[1]  Alexander H. Waibel,et al.  Concept-based speech translation , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Richard Zens,et al.  Efficient Speech Translation Through Confusion Network Decoding , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Andreas Stolcke,et al.  Automatic disfluency identification in conversational speech using multiple knowledge sources , 2003, INTERSPEECH.

[4]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[5]  Srinivas Bangalore,et al.  Supertagging: An Approach to Almost Parsing , 1999, CL.

[6]  Rohit Prasad,et al.  The BBN 2007 displayless English/iraqi speech-to-speech translation system , 2007, INTERSPEECH.

[7]  Bonnie J. Dorr,et al.  The use of lexical semantics in interlingual machine translation , 2004, Machine Translation.

[8]  Dilek Z. Hakkani-Tür,et al.  Improving speech translation with automatic boundary prediction , 2007, INTERSPEECH.

[9]  Elmar Nöth,et al.  VERBMOBIL: the use of prosody in the linguistic components of a speech understanding system , 2000, IEEE Trans. Speech Audio Process..

[10]  Srinivas Bangalore,et al.  Statistical Machine Translation through Global Lexical Selection and Sentence Reconstruction , 2007, ACL.

[11]  Shrikanth S. Narayanan,et al.  Exploiting Acoustic and Syntactic Features for Automatic Prosody Labeling in a Maximum Entropy Framework , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Julia Hirschberg,et al.  Progress in speech synthesis , 1997 .

[13]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[14]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[15]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[16]  Winfield S. Bennett,et al.  The Place of Semantics in MT Systems , 1989 .

[17]  Antje Schweitzer,et al.  Unit selection synthesis in the Smartweb project , 2007, SSW.

[18]  Shrikanth S. Narayanan,et al.  Factored translation models for enriching spoken language translation with prosody , 2008, INTERSPEECH.

[19]  Shrikanth S. Narayanan,et al.  Combining lexical, syntactic and prosodic cues for improved online dialog act tagging , 2009, Comput. Speech Lang..

[20]  Mari Ostendorf,et al.  Joint prosody prediction and unit selection for concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[21]  Norbert Reithinger,et al.  Predicting dialogue acts for a speech-to-speech translation system , 1996 .

[22]  Michael Paul,et al.  Overview of the IWSLT06 evaluation campaign , 2006, IWSLT.

[23]  Antoine Raux,et al.  A unit selection approach to F0 modeling and its application to emphasis , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[24]  Jordi Adell,et al.  Prosody Generation for Speech-to-Speech Translation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[25]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[26]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.

[27]  Shay B. Cohen,et al.  Proceedings of ACL , 2013 .

[28]  Michael Picheny,et al.  Concept-Based Speech-to-Speech Translation Using Maximum Entropy Models for Statistical Natural Concept Generation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Mari Ostendorf,et al.  A dynamical system model for generating fundamental frequency for speech synthesis , 1999, IEEE Trans. Speech Audio Process..

[30]  Philipp Koehn,et al.  Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[31]  Andy Way,et al.  Supertagged Phrase-Based Statistical Machine Translation , 2007, ACL.

[32]  Stefanie Shattuck-Hufnagel,et al.  A prosodically labeled database of spontaneous speech , 2001 .

[33]  Philipp Koehn,et al.  Enriching Morphologically Poor Languages for Statistical Machine Translation , 2008, ACL.

[34]  James Glass,et al.  Fundamental frequency modeling for corpus-based speech synthesis based on a statistical learning technique , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[35]  Norbert Reithinger,et al.  Robust Content Extraction for Translation and Dialog Processing , 2000 .

[36]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[37]  Noah A. Smith,et al.  Proceedings of EMNLP , 2007 .

[38]  Shinta Kimura,et al.  Natural prosody generation for domain specific text-to-speech systems , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[39]  Srinivas Bangalore,et al.  Enriching Text-to-Speech Synthesis Using Automatic Dialog Act Tags , 2011, INTERSPEECH.

[40]  Alon Lavie,et al.  Dialogue processing in a conversational speech translation system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[41]  Peter Bell,et al.  Proceedings of Speech Prosody 2006 , 2006 .

[42]  Alon Lavie,et al.  An interlingua based on domain actions for machine translation of task-oriented dialogues , 1998, ICSLP.

[43]  Simon King,et al.  Modelling prominence and emphasis improves unit-selection synthesis , 2007, INTERSPEECH.

[44]  Andreas Stolcke,et al.  Automatic punctuation and disfluency detection in multi-party meetings using prosodic and lexical cues , 2002, INTERSPEECH.

[45]  Giuseppe Riccardi,et al.  How may I help you? , 1997, Speech Commun..

[46]  Patrick Haffner,et al.  Scaling large margin classifiers for spoken language understanding , 2006, Speech Commun..

[47]  Muntsin Kolss,et al.  The influence of utterance chunking on machine translation performance , 2007, INTERSPEECH.

[48]  Ruhi Sarikaya,et al.  IBM Mastor: Multilingual Automatic Speech-To-Speech Translator , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[49]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[50]  Kristin Precoda,et al.  Speech Recognition Engineering Issues in Speech to Speech Translation System Design for Low Resource Languages and Domains , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[51]  Satoshi Nakamura,et al.  The ATR Multilingual Speech-to-Speech Translation System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[52]  Alan W. Black,et al.  Generating F/sub 0/ contours from ToBI labels using linear regression , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[53]  Mari Ostendorf,et al.  Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[54]  Alan W. Black,et al.  Prosody and the Selection of Source Units for Concatenative Synthesis , 1997 .

[55]  Nick Campbell,et al.  Target Cost of F 0 Based on Pol Concatenative Speec , 2003 .

[56]  A. U.S. Enriching spoken language translation with dialog acts , 2008 .

[57]  Hermann Ney,et al.  On the integration of speech recognition and statistical machine translation , 2005, INTERSPEECH.

[58]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.