The USTC-NELSLIP Systems for Simultaneous Speech Translation Task at IWSLT 2021

This paper describes USTC-NELSLIP’s submissions to the IWSLT2021 Simultaneous Speech Translation task. We proposed a novel simultaneous translation model, Cross-Attention Augmented Transducer (CAAT), which extends conventional RNN-T to sequence-to-sequence tasks without monotonic constraints, e.g., simultaneous translation. Experiments on speech-to-text (S2T) and text-to-text (T2T) simultaneous translation tasks shows CAAT achieves better quality-latency trade-offs compared to wait-k, one of the previous state-of-the-art approaches. Based on CAAT architecture and data augmentation, we build S2T and T2T simultaneous translation systems in this evaluation campaign. Compared to last year’s optimal systems, our S2T simultaneous translation system improves by an average of 11.3 BLEU for all latency regimes, and our T2T simultaneous translation system improves by an average of 4.6 BLEU.

[1]  Jiajun Shen,et al.  Revisiting Self-Training for Neural Sequence Generation , 2020, ICLR.

[2]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[3]  Taku Kudo,et al.  MeCab : Yet Another Part-of-Speech and Morphological Analyzer , 2005 .

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[7]  Tie-Yan Liu,et al.  SimulSpeech: End-to-End Simultaneous Speech to Text Translation , 2020, ACL.

[8]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[9]  Juan Pino,et al.  SIMULEVAL: An Evaluation Toolkit for Simultaneous Translation , 2020, EMNLP.

[10]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Qiantong Xu,et al.  Self-Training for End-to-End Speech Translation , 2020, INTERSPEECH.

[12]  Renjie Zheng,et al.  Speculative Beam Search for Simultaneous Translation , 2019, EMNLP.

[13]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[14]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[15]  Jiajun Zhang,et al.  End-to-End Speech Translation with Knowledge Distillation , 2019, INTERSPEECH.

[16]  Xu Tan,et al.  Almost Unsupervised Text to Speech and Automatic Speech Recognition , 2019, ICML.

[17]  Haifeng Wang,et al.  STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework , 2018, ACL.

[18]  Bo Xu,et al.  Self-attention Aligner: A Latency-control End-to-end Model for ASR Using Self-attention Network and Chunk-hopping , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Juan Pino,et al.  Monotonic Multihead Attention , 2019, ICLR.

[20]  Wei Li,et al.  Monotonic Infinite Lookback Attention for Simultaneous Machine Translation , 2019, ACL.

[21]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[22]  Hulya Yalcin,et al.  Improving Low Resource Turkish Speech Recognition with Data Augmentation and TTS , 2019, 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD).

[23]  Benjamin Lecouteux,et al.  ON-TRAC Consortium for End-to-End and Simultaneous Speech Translation Challenge Tasks at IWSLT 2020 , 2020, IWSLT.

[24]  Renjie Zheng,et al.  Simultaneous Translation with Flexible Policy via Restricted Imitation Learning , 2019, ACL.

[25]  Yongqiang Wang,et al.  Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory , 2020, INTERSPEECH.

[26]  Graham Neubig,et al.  Learning to Translate in Real-time with Neural Machine Translation , 2016, EACL.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.