Self-Supervised End-to-End ASR for Low Resource L2 Swedish

Unlike traditional (hybrid) Automatic Speech Recognition (ASR), end-to-end ASR systems simplify the training procedure by directly mapping acoustic features to sequences of graphemes or characters, thereby eliminating the need for specialized acoustic, language, or pronunciation models. However, one drawback of end-to-end ASR systems is that they require more training data than conventional ASR systems to achieve similar word error rate (WER). This makes it difficult to develop ASR systems for tasks where transcribed target data is limited such as developing ASR for Second Language (L2) speakers of Swedish. Nonetheless, recent advancements in selfsupervised acoustic learning, manifested in wav2vec models [1, 2, 3], leverage the available untranscribed speech data to provide compact acoustic representation that can achieve low WER when incorporated in end-to-end systems. To this end, we experiment with several monolingual and cross-lingual selfsupervised acoustic models to develop end-to-end ASR system for L2 Swedish. Even though our test is very small, it indicates that these systems are competitive in performance with traditional ASR pipeline. Our best model seems to reduce the WER by 7% relative to our traditional ASR baseline trained on the same target data.

[1]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[2]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Gabriel Synnaeve,et al.  MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.

[4]  Alexei Baevski,et al.  Effectiveness of self-supervised pre-training for speech recognition , 2019, ArXiv.

[5]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[7]  Brian Kingsbury,et al.  Multilingual representations for low resource speech recognition and keyword search , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[8]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[9]  Martti Vainio,et al.  Developing a high-stake digital spoken language proficiency assessment: Results from pilot tests , 2016 .

[10]  Mark J. F. Gales,et al.  Data augmentation for low resource languages , 2014, INTERSPEECH.

[11]  Seongjin Park,et al.  A comparison between native and non-native speech for automatic speech recognition , 2019, The Journal of the Acoustical Society of America.

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  Xiaodong Cui,et al.  Data Augmentation for Deep Neural Network Acoustic Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2020, LREC.

[15]  Tatsuya Kawahara,et al.  Cross-Lingual Transfer Learning of Non-Native Acoustic Modeling for Pronunciation Error Detection and Diagnosis , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Stefan Schaden Generating Non - Native Pronuncia - tion Lexicons by Phonological Rule , 2003 .

[17]  Hung-yi Lee,et al.  Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Yuan Gao,et al.  Spoken English Intelligibility Remediation with Pocketsphinx Alignment and Feature Extraction Improves Substantially Over the State of the Art , 2017, 2018 2nd IEEE Advanced Information Management,Communicates,Electronic and Automation Control Conference (IMCEC).

[19]  George Saon,et al.  The IBM 2015 English conversational telephone speech recognition system , 2015, INTERSPEECH.

[20]  John P. McCrae,et al.  A Survey of Current Datasets for Code-Switching Research , 2020, 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS).

[21]  Emmanuel Dupoux,et al.  VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[22]  Diego Giuliani,et al.  Non-Native Children Speech Recognition Through Transfer Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[24]  Ronan Collobert,et al.  Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[25]  Avni Rajpal,et al.  Pseudo Likelihood Correction Technique for Low Resource Accented ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Geoffrey Zweig,et al.  Training ASR Models By Generation of Contextual Information , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[29]  Mark J. F. Gales,et al.  Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.