MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T) tasks utilizing unlabeled speech and text data, where speech-pseudo-codes pairs and phoneme-text pairs are a supplement to the supervised speech-text pairs. To train the encoder to learn better speech representation, we introduce self-supervised masked speech prediction (MSP) and supervised phoneme prediction (PP) tasks to learn to map speech into phonemes. Besides, we directly add the downstream supervised speech-to-text (S2T) task into the pre-training process, which can further improve the pre-training performance and achieve better recognition results even without fine-tuning. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.

[1]  Kyu J. Han,et al.  Wav2Seq: Pre-Training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Michael Auli,et al.  Unified Speech-Text Pre-training for Speech Translation and Recognition , 2022, ACL.

[3]  Jinyu Li,et al.  Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data , 2022, INTERSPEECH.

[4]  Michael Auli,et al.  data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language , 2022, ICML.

[5]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[6]  Ankur Bapna,et al.  mSLAM: Massively multilingual joint pre-training for speech and text , 2022, ArXiv.

[7]  Jinyu Li,et al.  WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , 2021, IEEE Journal of Selected Topics in Signal Processing.

[8]  Rui Wang,et al.  SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing , 2021, ACL.

[9]  Lei Xie,et al.  WENETSPEECH: A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Ankur Bapna,et al.  SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training , 2021, ArXiv.

[11]  Dmitriy Genzel,et al.  Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task , 2021, ACL.

[12]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Heng Ji,et al.  Learning Shared Semantic Space for Speech-to-Text Translation , 2021, FINDINGS.

[14]  Xianyan Jia,et al.  M6: A Chinese Multimodal Pretrainer , 2021, ArXiv.

[15]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[16]  Dmitriy Genzel,et al.  A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  J. Pino,et al.  Fairseq S2T: Fast Speech-to-Text Modeling with Fairseq , 2020, AACL.

[18]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[19]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[20]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[21]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[22]  Luke S. Zettlemoyer,et al.  Transformers with convolutional context for ASR , 2019, ArXiv.

[23]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  Hui Bu,et al.  AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale , 2018, ArXiv.

[26]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[27]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[31]  Wei Zhou The Homophone Effect in Mandarin Word Recognition , 2015 .

[32]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[33]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[34]  Philip Gage,et al.  A new algorithm for data compression , 1994 .