Learning speech translation from interpretation

The basic objective of this thesis is to examine the extent to which automatic speech translation can benefit from an often available but ignored resource, namely human interpreter speech. The main contribution of this thesis is a novel approach to speech translation development, which makes use of that resource. The performance of the statistical models employed in modern speech translation systems depends heavily on the availability of vast amounts of training data. State-of-the-art systems are typically trained on: (1) hundreds, sometimes thousands of hours of manually transcribed speech audio; (2) bi-lingual, sentence-aligned text corpora of manual translations, often comprising tens of millions of words; and (3) monolingual text corpora, often comprising hundreds of millions of words. The acquisition of such enormous data resources is highly time-consuming and expensive, rendering the development of deployable speech translation systems prohibitive to all but a handful of economically or politically viable languages. As a consequence, speech translation development for a new language pair or domain is typically triggered by global events, e.g. disaster relief operations, that incur a major need for cross-lingual, verbal communication—justifying the high development costs. In such situations, where an urgent need for cross-lingual communication exists, but no automatic speech translation solutions are (yet) available, communication is achieved with the help of human interpreters. In this thesis, we introduce methods that exploit audio recordings of interpreter-mediated communication scenarios for speech translation system development. By employing unsupervised and lightly supervised training techniques, the introduced methods allow to omit most of the manual transcription effort and all of the manual translation effort that has typically characterized speech translation system development. Thus, we are able to significantly reduce the amount of time-consuming and costly human supervision that is attached to speech translation system development. Further contributions of this thesis include: (a) a lightly supervised acoustic model training scheme for recordings of European Parliament Plenary Sessions, supporting the development of ASR systems in the various languages of the European Union without the need of costly verbatim transcriptions; and (b) a sentence segmentation and punctuation recovery scheme for speech translation, addressing the mismatch between output of automatic speech recognition and machine translation training data. Zusammenfassung Die vorliegende Dissertation behandelt die Frage ob automatische Sprachubersetzung Nutzen aus Audioaufnahmen menschlicher Interpretationsszenarien ziehen kann. Im Kern der Arbeit werden Ansatze entwickelt, die es erlauben, die an der Sprachubersetzung beteiligten Komponenten, automatische Spracherkennung und maschinelle Ubersetzung, mit Hilfe solcher Audioaufnahmen zu trainieren. Diese Ansatze werden anhand eines realen Anwendungsszenarios entwickelt, welches menschliche Simultanubersetzung (Interpretation), manuelle Transkription und manuelle Ubersetzung im grosen Stil verlangt: Sitzungen des Europaparlaments und die mit diesen Sitzungen verbundenen, multi-lingualen Dokumente. Die entwickelten Ansatze erlauben es, Sprachubersetzung direkt auf Aufnahmen menschlicher Interpretationsszenarien zu trainieren und benotigen dabei nur geringe Mengen an zeitaufwendiger und kostspieliger menschlicher Uberwachung. Insbesondere wird nur ein geringer Teil der bisher fur Sprachubersetzung notwendingen manuell transkribierten Sprachaufnahmen benotigt und keine der ansonsten notwendigen manuell angefertigten Ubersetzungen. Des weiteren wird im Rahmen dieser Dissertation ein Verfahren eingefuhrt, welches das Trainieren von Spracherkennungssystemen in den verschiedenen Sprachen der Europaischen Union unterstutzt. Hierbei werden die frei zuganglichen Textund Audioressourcen des Europaparlaments ausgenutzt, um akustische Modelle ohne kostspielige, wortgetreue Transkriptionen zu trainieren. Die vorliegenede Arbeit untersucht des weiteren, wie die Kombination von Spracherkennung und maschineller Ubersetzung mit Hilfe einer automatischen Satzsegmentierung und einer automatischen Wiederherstellung von Satzzeichen verbessert werden kann. Appendix A beinhaltet eine Kurzfassung der Dissertation in deutscher Sprache.

[1]  Robert L. Mercer,et al.  Automatic speech recognition in machine-aided translation , 1994, Comput. Speech Lang..

[2]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[3]  Richard M. Schwartz,et al.  Unsupervised acoustic and language model training with small amounts of labelled data , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Tanja Schultz,et al.  Speaker segmentation and clustering in meetings , 2004, INTERSPEECH.

[5]  P. Fung,et al.  Multilingual spoken language processing , 2008, IEEE Signal Processing Magazine.

[6]  Alexander H. Waibel,et al.  Automatic translation from parallel speech: Simultaneous interpretation as MT training data , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[7]  Hermann Ney,et al.  Automatic text dictation in computer-assisted translation , 2005, INTERSPEECH.

[8]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[9]  C. Julian Chen,et al.  Speech recognition with automatic punctuation , 1999, EUROSPEECH.

[10]  Alexander H. Waibel,et al.  Spoken language translation from parallel speech audio: Simultaneous interpretation as SLT training data , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[12]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[13]  Matthias Eck,et al.  Developing Deployable Spoken Language Translation Systems given Limited Resources , 2008 .

[14]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[15]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[16]  Taro Watanabe,et al.  A Unified Approach in Speech-to-Speech Translation: Integrating Features of Speech recognition and Machine Translation , 2004, COLING.

[17]  Salim Roukos,et al.  Maximum likelihood and discriminative training of direct translation models , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[18]  Puming Zhan,et al.  Speaker normalization based on frequency warping , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Ronald Rosenfeld,et al.  Improving trigram language modeling with the World Wide Web , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[20]  Yoshihiko Gotoh,et al.  Sentence Boundary Detection in Broadcast Speech Transcripts , 2000 .

[21]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[22]  Frank Smith,et al.  Reading Without Nonsense , 1979 .

[23]  Marc Dymetman,et al.  Towards an automatic dictation system for translators : the transtalk project , 1994, ICSLP.

[24]  Fred Van Besien,et al.  Anticipation in Simultaneous Interpretation , 1999 .

[25]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[26]  Pascale Fung,et al.  A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora , 1998, AMTA.

[27]  Danica Seleskovitch,et al.  Interpreting for International Conferences: Problems of Language and Communication , 1994 .

[28]  Jean-Luc Gauvain,et al.  Investigating lightly supervised acoustic model training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[29]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Alexander H. Waibel,et al.  Spoken language translation , 2008, IEEE Signal Processing Magazine.

[31]  Mary P. Harper,et al.  Structural event detection for rich transcription of speech , 2004 .

[32]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[33]  Nicola Ueffing,et al.  Using monolingual source-language data to improve MT performance , 2006, IWSLT.

[34]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[35]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[36]  M. Wolfel,et al.  Minimum variance distortionless response spectral estimation , 2005, IEEE Signal Processing Magazine.

[37]  A. Waibel,et al.  Speech translation enhanced automatic speech recognition , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[38]  Rajai Al-Khanji,et al.  On the use of compensatory strategies in simultaneous interpretation , 2000 .

[39]  Martin Raab,et al.  The ISL TC-STAR Spring 2006 ASR Evaluation Systems , 2006 .

[40]  T. Pardo,et al.  Statistical Phrase-based Machine Translation : Experiments with Brazilian Portuguese , 2009 .

[41]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[42]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[43]  Alex Waibel,et al.  MEETING BROWSER: TRACKING AND SUMMARIZING MEETINGS , 2007 .

[44]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[45]  Klaus Ries,et al.  The Karlsruhe-Verbmobil speech recognition engine , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[46]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[47]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[48]  Hermann Ney,et al.  Cross domain automatic transcription on the TC-STAR EPPS corpus , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[49]  I. Lee Hetherington A characterization of the problem of new, out-of-vocabulary words in continuous-speech recognition and understanding , 1995 .

[50]  Jörg Tiedemann,et al.  Combining Clues for Word Alignment , 2003, EACL.

[51]  Bowen Zhou,et al.  TOWARDS SPEECH TRANSLATION OF NON WRITTEN LANGUAGES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[52]  Roland Kuhn,et al.  Phrasetable Smoothing for Statistical Machine Translation , 2006, EMNLP.

[53]  Roland Kuhn,et al.  French speech recognition in an automatic dictation system for translators: the transtalk project , 1995, EUROSPEECH.

[54]  Tanja Schultz,et al.  Optimizing sentence segmentation for spoken language translation , 2007, INTERSPEECH.

[55]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[56]  Sebastian Stüker,et al.  Acoustic modelling for under-resourced languages , 2009 .

[57]  Kurt Kohn,et al.  The strategic dimension of interpreting , 1996 .

[58]  A. Waibel,et al.  Translating language with technology's help , 2007, IEEE Potentials.

[59]  Alexander H. Waibel,et al.  Lightly supervised acoustic model training on EPPS recordings , 2008, INTERSPEECH.

[60]  S. Vogel,et al.  SMT decoder dissected: word reordering , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[61]  Thomas Schaaf,et al.  Lecture and presentation tracking in an intelligent meeting room , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[62]  Geoffrey Zweig,et al.  Maximum Entropy Modeling for Punctuation from Speech , 2002 .

[63]  Holger Schwenk,et al.  Investigations on large-scale lightly-supervised training for statistical machine translation. , 2008, IWSLT.

[64]  Ralf Schlüter,et al.  Investigations on discriminative training criteria , 2000 .

[65]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[66]  F. Casacuberta,et al.  Recent efforts in spoken language translation , 2008, IEEE Signal Processing Magazine.

[67]  Philipp Koehn,et al.  Manual and Automatic Evaluation of Machine Translation between European Languages , 2006, WMT@HLT-NAACL.

[68]  Yaser Al-Onaizan,et al.  Arabic ASR and MT Integration for GALE , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[69]  N. Bertoldi,et al.  A new decoder for spoken language translation based on confusion networks , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[70]  Richard C. Rose,et al.  Towards domain independence in machine aided human translation , 2008, INTERSPEECH.

[71]  Hermann Ney,et al.  On the integration of speech recognition and statistical machine translation , 2005, INTERSPEECH.

[72]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[73]  Philipp Koehn,et al.  A parallel corpus for statistical machine translation , 2005 .

[74]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[75]  Tanja Schultz,et al.  Sentence segmentation and punctuation recovery for spoken language translation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[76]  Hermann Ney,et al.  Evaluating Machine Translation Output with Automatic Sentence Segmentation , 2005, IWSLT.

[77]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[78]  Hermann Ney,et al.  Dynamic programming search for continuous speech recognition , 1999, IEEE Signal Process. Mag..