论文信息 - Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Speech self-supervised learning has attracted much attention due to its promising performance in multiple downstream tasks, and has become a new growth engine for speech recognition in low-resource languages. In this paper, we exploit and analyze a series of wav2vec pre-trained models for speech recognition in 15 low-resource languages in the OpenASR21 Challenge. The investigation covers two important variables during pre-training, three fine-tuning methods, as well as applications in End-to-End and hybrid systems. First, pre-trained models with different pre-training audio data and architectures (wav2vec2.0, HuBERT and WavLM) are explored for their speech recognition performance in low-resource languages. Second, we investigate data utilization, multilingual learning, and the use of a phoneme-level recognition task in fine-tuning. Furthermore, we explore what effect fine-tuning has on the similarity of representations extracted from different transformer layers. The similarity analyses cover different pre-trained architectures and fine-tuning languages. We apply pre-trained representations to End-to-End and hybrid systems to confirm our representation analyses, which have obtained better performances as well.

Weiqiang Zhang | Jing Zhao

[1] Juan M. Perero-Codosero,et al. A Comparison of Hybrid and End-to-End ASR Systems for the IberSpeech-RTVE 2020 Speech-to-Text Transcription Challenge , 2022, Applied Sciences.

[2] Juan Pino,et al. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale , 2021, INTERSPEECH.

[3] Jinyu Li,et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , 2021, IEEE Journal of Selected Topics in Signal Processing.

[4] Sanjeev Khudanpur,et al. Injecting Text and Cross-Lingual Supervision in Few-Shot Learning from Self-Supervised Models , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Li Dong,et al. XLM-E: Cross-lingual Language Model Pre-training via ELECTRA , 2021, ACL.

[6] Weiqiang Zhang,et al. Automatic Speech Recognition for Low-Resource Languages: The Thuee Systems for the IARPA Openasr20 Evaluation , 2021, Automatic Speech Recognition & Understanding.

[7] A. Heba,et al. A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding , 2021, ArXiv.

[8] Hung-yi Lee,et al. Speech Representation Learning Through Self-supervised Pretraining And Multi-task Finetuning , 2021, ArXiv.

[9] Zhongqin Wu,et al. Language Recognition Based on Unsupervised Pretrained Models , 2021, Interspeech.

[10] Ashish R. Mittal,et al. Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration , 2021, Interspeech.

[11] M. Hasegawa-Johnson,et al. Zero-Shot Cross-Lingual Phonetic Recognition with External Language Embedding , 2021, Interspeech.

[12] N KrishnaD,et al. Multilingual Speech Recognition for Low-Resource Indian Languages using Multi-Task conformer , 2021, ArXiv.

[13] Priyanshi Shah,et al. CLSRIL-23: Cross Lingual Speech Representations for Indic Languages , 2021, ArXiv.

[14] Karen Livescu,et al. Layer-Wise Analysis of a Self-Supervised Speech Representation Model , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[15] Ruslan Salakhutdinov,et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16] Xiangang Li,et al. GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10, 000 Hours of Transcribed Audio , 2021, Interspeech.

[17] Andy T. Liu,et al. SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[18] Marcely Zanon Boito,et al. LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech , 2021, Interspeech.

[19] Gabriel Synnaeve,et al. Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training , 2021, Interspeech.

[20] Kevin Duh,et al. Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yolóxochitl Mixtec , 2021, EACL.

[21] Shiyu Zhou,et al. Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition , 2021, IEEE Signal Processing Letters.

[22] Emmanuel Dupoux,et al. VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[23] James R. Glass,et al. Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies , 2020, Interspeech.

[24] Shang-Wen Li,et al. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25] Ronan Collobert,et al. Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[26] Heung-Seon Oh,et al. Wav2KWS: Transfer Learning From Speech Representations for Keyword Spotting , 2021, IEEE Access.

[27] Gabriel Synnaeve,et al. MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.

[28] Tie-Yan Liu,et al. LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition , 2020, KDD.

[29] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[30] James R. Glass,et al. Vector-Quantized Autoregressive Predictive Coding , 2020, INTERSPEECH.

[31] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[32] Vishwas M. Shetty,et al. Improving the Performance of Transformer Based Low Resource Speech Recognition for Indian Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33] Ivan Medennikov,et al. Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription , 2020, INTERSPEECH.

[34] Armand Joulin,et al. Unsupervised Pretraining Transfers Well Across Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35] Yoshua Bengio,et al. Multi-Task Self-Supervised Learning for Robust Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36] Abdel-rahman Mohamed,et al. Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] Francis M. Tyers,et al. Common Voice: A Massively-Multilingual Speech Corpus , 2019, LREC.

[38] Hung-yi Lee,et al. Meta Learning for End-To-End Low-Resource Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39] Andy T. Liu,et al. Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders , 2019, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[40] Alexei Baevski,et al. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[41] Geoffrey E. Hinton,et al. Similarity of Neural Network Representations Revisited , 2019, ICML.

[42] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[43] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[44] Yoshua Bengio,et al. Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks , 2019, INTERSPEECH.

[45] Hao Tang,et al. An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[46] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[47] Hulya Yalcin,et al. Improving Low Resource Turkish Speech Recognition with Data Augmentation and TTS , 2019, 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD).

[48] Anusha Prakash,et al. Articulatory and Stacked Bottleneck Features for Low Resource Speech Recognition , 2018, INTERSPEECH.

[49] Yiming Wang,et al. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[50] Taku Kudo,et al. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[51] Shuang Xu,et al. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[53] Florian Metze,et al. Sequence-Based Multi-Lingual Low Resource Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54] John R. Hershey,et al. Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[55] Ekapol Chuangsuwanich,et al. Multilingual techniques for low resource automatic speech recognition , 2016 .

[56] A. Waibel,et al. Towards Improving Low-Resource Speech Recognition Using Articulatory and Language Features , 2016, IWSLT.

[57] Daniel Povey,et al. MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[58] Brian Kan-Wing Mak,et al. Multitask Learning of Deep Neural Networks for Low-Resource Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[59] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60] Sanjeev Khudanpur,et al. Audio augmentation for speech recognition , 2015, INTERSPEECH.

[61] Chng Eng Siong,et al. A comparative study of BNF and DNN multilingual training on cross-lingual low-resource speech recognition , 2015, INTERSPEECH.

[62] Mark J. F. Gales,et al. Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.

[63] Cheung-Chi Leung,et al. Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64] Mark J. F. Gales,et al. Data augmentation for low resource languages , 2014, INTERSPEECH.

[65] George Saon,et al. Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[66] Florian Metze,et al. Deep maxout networks for low-resource speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[67] Kenneth Ward Church,et al. Deep neural network features and semi-supervised training for low resource speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[68] Pascal Vincent,et al. Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69] Ngoc Thang Vu,et al. Multilingual bottle-neck features and its application for under-resourced languages , 2012, SLTU.

[70] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[71] Ngoc Thang Vu,et al. Rapid Building of an ASR System for Under-Resourced Languages Based on Multilingual Unsupervised Training , 2011, INTERSPEECH.

[72] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[73] Bhadriraju Krishnamurti,et al. The Dravidian Languages , 2003 .

[74] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[75] H. T. Edwards,et al. Characteristics of Vietnamese Phonology , 2002 .

[76] Sarah L. Nesbeitt. Ethnologue: Languages of the World , 1999 .

[77] Rachel Walker,et al. Guaraní Voiceless Stops in Oral versus Nasal Contexts: An Acoustical Study , 1999, Journal of the International Phonetic Association.