Speechformer: Reducing Information Loss in Direct Speech Translation

Transformer-based models have gained increasing popularity achieving state-of-the-art performance in many research fields including speech translation. However, Transformer’s quadratic complexity with respect to the input sequence length prevents its adoption as is with audio signals, which are typically represented by long sequences. Current solutions resort to an initial sub-optimal compression based on a fixed sampling of raw audio features. Therefore, potentially useful linguistic information is not accessible to higher-level layers in the architecture. To solve this issue, we propose Speechformer, an architecture that, thanks to a reduced memory usage in the attention layers, avoids the initial lossy compression and aggregates information only at a higher level according to more informed linguistic criteria. Experiments on three language pairs (en→de/es/nl) show the efficacy of our solution, with gains of up to 0.8 BLEU on the standard MuST-C corpus and of up to 4.0 BLEU in a low resource scenario.

[1]  Elizabeth Salesky,et al.  Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation , 2019, ACL.

[2]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[4]  Matteo Negri,et al.  Enhancing Transformer for End-to-end Speech-to-Text Translation , 2019, MTSummit.

[5]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[6]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[7]  Kevin Duh,et al.  ESPnet-ST: All-in-One Speech Translation Toolkit , 2020, ACL.

[8]  Jiajun Zhang,et al.  Bridging the Modality Gap for Speech-to-Text Translation , 2020, ArXiv.

[9]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[10]  Olivier Pietquin,et al.  End-to-End Automatic Speech Translation of Audiobooks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Mauro Cettolo,et al.  CTC-based Compression for Direct Speech Translation , 2021, EACL.

[12]  Nadir Durrani,et al.  FINDINGS OF THE IWSLT 2020 EVALUATION CAMPAIGN , 2020, IWSLT.

[13]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[14]  Olivier Pietquin,et al.  Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[15]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[16]  Matthias Sperber,et al.  Speech Translation and the End-to-End Promise: Taking Stock of Where We Are , 2020, ACL.

[17]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[20]  Joseph Olive,et al.  Machine Translation from Speech , 2011 .

[21]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Mattia Antonino Di Gangi,et al.  MuST-C: A multilingual corpus for end-to-end speech translation , 2021, Comput. Speech Lang..

[24]  Yunchao Wei,et al.  Delving Deep Into Label Smoothing , 2020, IEEE Transactions on Image Processing.

[25]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[26]  Dmytro Okhonko,et al.  fairseq S2T: Fast Speech-to-Text Modeling with fairseq , 2020, AACL.

[27]  Alex Waibel,et al.  JANUS: a speech-to-speech translation system using connectionist and symbolic processing strategies , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[28]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[29]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Hermann Ney,et al.  A Comparative Study on End-to-End Speech to Text Translation , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).