Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR

Simultaneous speech-to-text translation is widely useful in many scenarios. The conventional cascaded approach uses a pipeline of streaming ASR followed by simultaneous MT, but suffers from error propagation and extra latency. To alleviate these issues, recent efforts attempt to directly translate the source speech into target text simultaneously, but this is much harder due to the combination of two separate tasks. We instead propose a new paradigm with the advantages of both cascaded and endto-end approaches. The key idea is to use two separate, but synchronized, decoders on streaming ASR and direct speech-to-text translation (ST), respectively, and the intermediate results of ASR guide the decoding policy of (but is not fed as input to) ST. During training time, we use multitask learning to jointly learn these two tasks with a shared encoder. En-toDe and En-to-Es experiments on the MuSTC dataset demonstrate that our proposed technique achieves substantially better translation quality at similar levels of latency.

[1]  Tomoki Toda,et al.  Optimizing Segmentation Strategies for Simultaneous Speech Translation , 2014, ACL.

[2]  Kenneth Ward Church,et al.  Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training , 2020, FINDINGS.

[3]  Chengyi Wang,et al.  Low Latency End-to-End Streaming Speech Recognition with a Scout Network , 2020, INTERSPEECH.

[4]  Naveen Arivazhagan,et al.  Re-Translation Strategies for Long Form, Simultaneous, Spoken Language Translation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Mattia Antonino Di Gangi,et al.  MuST-C: a Multilingual Speech Translation Corpus , 2019, NAACL.

[6]  Juan Pino,et al.  Streaming Simultaneous Speech Translation with Augmented Memory Transformer , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Nadir Durrani,et al.  Incremental Decoding and Training Methods for Simultaneous Translation in Neural Machine Translation , 2018, NAACL.

[8]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[9]  Junkun Chen,et al.  Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation , 2021, ICML.

[10]  Juan Pino,et al.  SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation , 2020, AACL.

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Tie-Yan Liu,et al.  SimulSpeech: End-to-End Simultaneous Speech to Text Translation , 2020, ACL.

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[16]  Renjie Zheng,et al.  Simultaneous Translation with Flexible Policy via Restricted Imitation Learning , 2019, ACL.

[17]  Haifeng Wang,et al.  DuTongChuan: Context-aware Translation Model for Simultaneous Interpreting , 2019, ArXiv.

[18]  Haifeng Wang,et al.  STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework , 2018, ACL.

[19]  Wei Li,et al.  Monotonic Infinite Lookback Attention for Simultaneous Machine Translation , 2019, ACL.

[20]  Liang Huang,et al.  Simultaneous Translation Policies: From Fixed to Adaptive , 2020, ACL.

[21]  Renjie Zheng,et al.  Simpler and Faster Learning of Adaptive Policies for Simultaneous Translation , 2019, EMNLP.

[22]  Jonathan Le Roux,et al.  Streaming Automatic Speech Recognition with the Transformer Model , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Liang Huang,et al.  MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation , 2020, ArXiv.