AlloST: Low-resource Speech Translation without Source Transcription

The end-to-end architecture has made promising progress in speech translation (ST). However, the ST task is still challenging under low-resource conditions. Most ST models have shown unsatisfactory results, especially in the absence of word information from the source speech utterance. In this study, we survey methods to improve ST performance without using source transcription, and propose a learning framework that utilizes a language-independent universal phone recognizer. The framework is based on an attention-based sequence-to-sequence model, where the encoder generates the phonetic embeddings and phone-aware acoustic representations, and the decoder controls the fusion of the two embedding streams to produce the target token sequence. In addition to investigating different fusion strategies, we explore the specific usage of byte pair encoding (BPE), which compresses a phone sequence into a syllablelike segmented sequence with semantic information. Experiments conducted on the Fisher Spanish-English and TaigiMandarin drama corpora show that our method outperforms the conformer-based baseline, and the performance is close to that of the existing best method using source transcription.

[1]  Elizabeth Salesky,et al.  Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation , 2019, ACL.

[2]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[3]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[4]  Elizabeth Salesky,et al.  Phone Features Improve Speech Translation , 2020, ACL.

[5]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[6]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[7]  Matt Post,et al.  Improved speech-to-text translation with the Fisher and Callhome Spanish-English speech translation corpus , 2013, IWSLT.

[8]  Kevin Duh,et al.  ESPnet-ST: All-in-One Speech Translation Toolkit , 2020, ACL.

[9]  Jong-Hyeok Lee,et al.  Transformer-based Automatic Post-Editing Model with Joint Encoder and Multi-source Attention of Decoder , 2019, WMT.

[10]  Olivier Pietquin,et al.  Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[11]  Sebastian Stüker,et al.  Breaking the Unwritten Language Barrier: The BULB Project , 2016, SLTU.

[12]  Elena Voita,et al.  BPE-Dropout: Simple and Effective Subword Regularization , 2020, ACL.

[14]  Jörg Franke,et al.  Towards phoneme inventory discovery for documentation of unwritten languages , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  F. Casacuberta,et al.  Recent efforts in spoken language translation , 2008, IEEE Signal Processing Magazine.

[16]  Zhenglu Yang,et al.  Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation , 2020, AAAI.

[17]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[18]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[19]  Matthias Sperber,et al.  Attention-Passing Models for Robust and Data-Efficient End-to-End Speech Translation , 2019, TACL.

[20]  Hung-yi Lee,et al.  Worse WER, but Better BLEU? Leveraging Word Embedding as Intermediate in Multitask End-to-End Speech Translation , 2020, ACL.

[21]  David Chiang,et al.  Tied Multitask Learning for Neural Speech Translation , 2018, NAACL.

[22]  Jong-Hyeok Lee,et al.  Multi-encoder Transformer Network for Automatic Post-Editing , 2018, WMT.

[23]  Aline Villavicencio,et al.  Unwritten languages demand attention too! Word discovery with encoder-decoder models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[24]  Alan W Black,et al.  Universal Phone Recognition with a Multilingual Allophone System , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[28]  Lin-Shan Lee,et al.  Towards End-to-end Speech-to-text Translation with Two-pass Decoding , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Jiajun Zhang,et al.  End-to-End Speech Translation with Knowledge Distillation , 2019, INTERSPEECH.

[30]  Jindrich Libovický,et al.  Input Combination Strategies for Multi-Source Transformer Decoder , 2018, WMT.

[31]  Zhenglu Yang,et al.  Curriculum Pre-training for End-to-End Speech Translation , 2020, ACL.