论文信息 - From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency.

[1] A. Alwan,et al. Towards Better Domain Adaptation for Self-Supervised Models: A Case Study of Child ASR , 2022, IEEE Journal of Selected Topics in Signal Processing.

[2] Tara N. Sainath,et al. Massively Multilingual ASR: A Lifelong Learning Solution , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Kai-Wei Chang,et al. SpeechPrompt: An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks , 2022, 2203.16773.

[4] Tara N. Sainath,et al. Joint Unsupervised and Supervised Training for Multilingual ASR , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Tara N. Sainath,et al. BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition , 2021, IEEE Journal of Selected Topics in Signal Processing.

[6] Yoav Goldberg,et al. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , 2021, ACL.

[7] Fadi Biadsy,et al. Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech , 2021, EMNLP.

[8] Chung-Cheng Chiu,et al. w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9] Chao-Han Huck Yang,et al. Voice2Series: Reprogramming Acoustic Models for Time Series Classification , 2021, ICML.

[10] Tara N. Sainath,et al. Scaling End-to-End Models for Large-Scale Multilingual ASR , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[11] Yifan Gong,et al. On Addressing Practical Challenges for RNN-Transducer , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[12] M. Seltzer,et al. Memory-Efficient Speech Recognition on Smart Devices , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Karen Hambardzumyan,et al. WARP: Word-level Adversarial ReProgramming , 2021, ACL.

[14] Ronan Collobert,et al. Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[15] Yu Tsao,et al. A Study of Low-Resource Speech Commands Recognition based on Adversarial Reprogramming , 2021, ArXiv.

[16] Gabriel Synnaeve,et al. MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.

[17] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[18] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[19] Tara N. Sainath,et al. Towards Fast and Accurate Streaming End-To-End ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Abdel-rahman Mohamed,et al. Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Ankur Bapna,et al. Simple, Scalable Adaptation for Neural Machine Translation , 2019, EMNLP.

[22] Tara N. Sainath,et al. Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model , 2019, INTERSPEECH.

[23] Tara N. Sainath,et al. Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models , 2019, INTERSPEECH.

[24] Tara N. Sainath,et al. Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[25] Mona Attariyan,et al. Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[26] Yann Dauphin,et al. Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[27] Tara N. Sainath,et al. Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Adam Lopez,et al. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation , 2018, NAACL.

[29] Jascha Sohl-Dickstein,et al. Adversarial Reprogramming of Neural Networks , 2018, ICLR.

[30] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] Hervé Bourlard,et al. An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation , 2017, INTERSPEECH.

[33] Andrea Vedaldi,et al. Learning multiple visual domains with residual adapters , 2017, NIPS.

[34] Tat-Seng Chua,et al. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Steve Renals,et al. Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation , 2016, IEEE ACM Trans. Audio Speech Lang. Process..

[36] Tara N. Sainath,et al. Deep Convolutional Neural Networks for Large-scale Speech Tasks , 2015, Neural Networks.

[37] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.