Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Speech self-supervised learning has attracted much attention due to its promising performance in multiple downstream tasks, and has become a new growth engine for speech recognition in low-resource languages. In this paper, we exploit and analyze a series of wav2vec pre-trained models for speech recognition in 15 low-resource languages in the OpenASR21 Challenge. The investigation covers two important variables during pre-training, three fine-tuning methods, as well as applications in End-to-End and hybrid systems. First, pre-trained models with different pre-training audio data and architectures (wav2vec2.0, HuBERT and WavLM) are explored for their speech recognition performance in low-resource languages. Second, we investigate data utilization, multilingual learning, and the use of a phoneme-level recognition task in fine-tuning. Furthermore, we explore what effect fine-tuning has on the similarity of representations extracted from different transformer layers. The similarity analyses cover different pre-trained architectures and fine-tuning languages. We apply pre-trained representations to End-to-End and hybrid systems to confirm our representation analyses, which have obtained better performances as well.

[1]  Juan M. Perero-Codosero,et al.  A Comparison of Hybrid and End-to-End ASR Systems for the IberSpeech-RTVE 2020 Speech-to-Text Transcription Challenge , 2022, Applied Sciences.

[2]  Juan Pino,et al.  XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale , 2021, INTERSPEECH.

[3]  Jinyu Li,et al.  WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , 2021, IEEE Journal of Selected Topics in Signal Processing.

[4]  Sanjeev Khudanpur,et al.  Injecting Text and Cross-Lingual Supervision in Few-Shot Learning from Self-Supervised Models , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Li Dong,et al.  XLM-E: Cross-lingual Language Model Pre-training via ELECTRA , 2021, ACL.

[6]  Weiqiang Zhang,et al.  Automatic Speech Recognition for Low-Resource Languages: The Thuee Systems for the IARPA Openasr20 Evaluation , 2021, Automatic Speech Recognition & Understanding.

[7]  A. Heba,et al.  A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding , 2021, ArXiv.

[8]  Hung-yi Lee,et al.  Speech Representation Learning Through Self-supervised Pretraining And Multi-task Finetuning , 2021, ArXiv.

[9]  Zhongqin Wu,et al.  Language Recognition Based on Unsupervised Pretrained Models , 2021, Interspeech.

[10]  Ashish R. Mittal,et al.  Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration , 2021, Interspeech.

[11]  M. Hasegawa-Johnson,et al.  Zero-Shot Cross-Lingual Phonetic Recognition with External Language Embedding , 2021, Interspeech.

[12]  N KrishnaD,et al.  Multilingual Speech Recognition for Low-Resource Indian Languages using Multi-Task conformer , 2021, ArXiv.

[13]  Priyanshi Shah,et al.  CLSRIL-23: Cross Lingual Speech Representations for Indic Languages , 2021, ArXiv.

[14]  Karen Livescu,et al.  Layer-Wise Analysis of a Self-Supervised Speech Representation Model , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[15]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Xiangang Li,et al.  GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10, 000 Hours of Transcribed Audio , 2021, Interspeech.

[17]  Andy T. Liu,et al.  SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[18]  Marcely Zanon Boito,et al.  LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech , 2021, Interspeech.

[19]  Gabriel Synnaeve,et al.  Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training , 2021, Interspeech.

[20]  Kevin Duh,et al.  Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yolóxochitl Mixtec , 2021, EACL.

[21]  Shiyu Zhou,et al.  Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition , 2021, IEEE Signal Processing Letters.

[22]  Emmanuel Dupoux,et al.  VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[23]  James R. Glass,et al.  Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies , 2020, Interspeech.

[24]  Shang-Wen Li,et al.  TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Ronan Collobert,et al.  Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[26]  Heung-Seon Oh,et al.  Wav2KWS: Transfer Learning From Speech Representations for Keyword Spotting , 2021, IEEE Access.

[27]  Gabriel Synnaeve,et al.  MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.

[28]  Tie-Yan Liu,et al.  LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition , 2020, KDD.

[29]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[30]  James R. Glass,et al.  Vector-Quantized Autoregressive Predictive Coding , 2020, INTERSPEECH.

[31]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[32]  Vishwas M. Shetty,et al.  Improving the Performance of Transformer Based Low Resource Speech Recognition for Indian Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Ivan Medennikov,et al.  Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription , 2020, INTERSPEECH.

[34]  Armand Joulin,et al.  Unsupervised Pretraining Transfers Well Across Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Yoshua Bengio,et al.  Multi-Task Self-Supervised Learning for Robust Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Abdel-rahman Mohamed,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2019, LREC.

[38]  Hung-yi Lee,et al.  Meta Learning for End-To-End Low-Resource Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Andy T. Liu,et al.  Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders , 2019, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[40]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[41]  Geoffrey E. Hinton,et al.  Similarity of Neural Network Representations Revisited , 2019, ICML.

[42]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[43]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[44]  Yoshua Bengio,et al.  Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks , 2019, INTERSPEECH.

[45]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[46]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[47]  Hulya Yalcin,et al.  Improving Low Resource Turkish Speech Recognition with Data Augmentation and TTS , 2019, 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD).

[48]  Anusha Prakash,et al.  Articulatory and Stacked Bottleneck Features for Low Resource Speech Recognition , 2018, INTERSPEECH.

[49]  Yiming Wang,et al.  Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[50]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[51]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[53]  Florian Metze,et al.  Sequence-Based Multi-Lingual Low Resource Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[55]  Ekapol Chuangsuwanich,et al.  Multilingual techniques for low resource automatic speech recognition , 2016 .

[56]  A. Waibel,et al.  Towards Improving Low-Resource Speech Recognition Using Articulatory and Language Features , 2016, IWSLT.

[57]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[58]  Brian Kan-Wing Mak,et al.  Multitask Learning of Deep Neural Networks for Low-Resource Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[59]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[61]  Chng Eng Siong,et al.  A comparative study of BNF and DNN multilingual training on cross-lingual low-resource speech recognition , 2015, INTERSPEECH.

[62]  Mark J. F. Gales,et al.  Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.

[63]  Cheung-Chi Leung,et al.  Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64]  Mark J. F. Gales,et al.  Data augmentation for low resource languages , 2014, INTERSPEECH.

[65]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[66]  Florian Metze,et al.  Deep maxout networks for low-resource speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[67]  Kenneth Ward Church,et al.  Deep neural network features and semi-supervised training for low resource speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[68]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  Ngoc Thang Vu,et al.  Multilingual bottle-neck features and its application for under-resourced languages , 2012, SLTU.

[70]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[71]  Ngoc Thang Vu,et al.  Rapid Building of an ASR System for Under-Resourced Languages Based on Multilingual Unsupervised Training , 2011, INTERSPEECH.

[72]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[73]  Bhadriraju Krishnamurti,et al.  The Dravidian Languages , 2003 .

[74]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[75]  H. T. Edwards,et al.  Characteristics of Vietnamese Phonology , 2002 .

[76]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[77]  Rachel Walker,et al.  Guaraní Voiceless Stops in Oral versus Nasal Contexts: An Acoustical Study , 1999, Journal of the International Phonetic Association.