Consistent Transcription and Translation of Speech

The conventional paradigm in speech translation starts with a speech recognition step to generate transcripts, followed by a translation step with the automatic transcripts as input. To address various shortcomings of this paradigm, recent work explores end-to-end trainable direct models that translate without transcribing. However, transcripts can be an indispensable output in practical applications, which often display transcripts alongside the translations to users. We make this common requirement explicit and explore the task of jointly transcribing and translating speech. Although high accuracy of transcript and translation are crucial, even highly accurate systems can suffer from inconsistencies between both outputs that degrade the user experience. We introduce a methodology to evaluate consistency and compare several modeling approaches, including the traditional cascaded approach and end-to-end models. We find that direct models are poorly suited to the joint transcription/translation task, but that end-to-end models that feature a coupled inference procedure are able to achieve strong consistency. We further introduce simple techniques for directly optimizing for consistency, and analyze the resulting trade-offs between consistency, transcription accuracy, and translation accuracy.1

[1]  Christian Fügen,et al.  A System for Simultaneous Translation of Lectures and Speeches , 2009 .

[2]  Satoshi Nakamura,et al.  Structured-Based Curriculum Learning for End-to-End English-Japanese Speech Translation , 2017, INTERSPEECH.

[3]  Mattia Antonino Di Gangi,et al.  MuST-C: a Multilingual Speech Translation Corpus , 2019, NAACL.

[4]  Sameer Singh,et al.  Are Red Roses Red? Evaluating Consistency of Question-Answering Models , 2019, ACL.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Marcel Urner,et al.  Approaches To Translation , 2016 .

[7]  Arun Narayanan,et al.  From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[8]  Li Deng,et al.  Speech Recognition, Machine Translation, and Speech Translation—A Unified Discriminative Learning Paradigm [Lecture Notes] , 2011, IEEE Signal Processing Magazine.

[9]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[10]  Matteo Negri,et al.  Adapting Transformer to End-to-End Spoken Language Translation , 2019, INTERSPEECH.

[11]  Eleftherios Avramidis,et al.  Evaluation without references: IBM1 scores as evaluation metrics , 2011, WMT@EMNLP.

[12]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Transcribe Foreign Speech , 2017, ArXiv.

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[15]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Tobias Domhan,et al.  How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures , 2018, ACL.

[18]  Alex Waibel,et al.  JANUS: a speech-to-speech translation system using connectionist and symbolic processing strategies , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[19]  Li Deng,et al.  Why word error rate is not a good metric for speech recognizer training for the speech translation task? , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Matthias Sperber,et al.  Attention-Passing Models for Robust and Data-Efficient End-to-End Speech Translation , 2019, TACL.

[21]  Jiajun Zhang,et al.  Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding , 2019, AAAI.

[22]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[23]  Yves Lepage,et al.  CHARCUT: Human-Targeted Character-Based MT Evaluation with Loose Differences , 2017, IWSLT.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Hermann Ney,et al.  Speech translation: coupling of recognition and translation , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[26]  Ying Zhang,et al.  Optimizing components for handheld two-way speech translation for an English-iraqi Arabic system , 2006, INTERSPEECH.

[27]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[28]  Matthias Sperber,et al.  XNMT: The eXtensible Neural Machine Translation Toolkit , 2018, AMTA.

[29]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[30]  G. Norman Likert scales, levels of measurement and the “laws” of statistics , 2010, Advances in health sciences education : theory and practice.

[31]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[32]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[33]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  David Chiang,et al.  Tied Multitask Learning for Neural Speech Translation , 2018, NAACL.

[35]  Edward Gibson,et al.  Using Mechanical Turk to Obtain and Analyze English Acceptability Judgments , 2011, Lang. Linguistics Compass.

[36]  Daniel S. Weld,et al.  No Explainability without Accountability: An Empirical Study of Explanations and Feedback in Interactive ML , 2020, CHI.

[37]  Joseph Olive,et al.  Machine Translation from Speech , 2011 .

[38]  André F. T. Martins,et al.  Findings of the WMT 2019 Shared Tasks on Quality Estimation , 2019, WMT.

[39]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[40]  Matthias Sperber,et al.  Neural Lattice-to-Sequence Models for Uncertain Inputs , 2017, EMNLP.

[41]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.