Training Hybrid Models on Noisy Transliterated Transcripts for Code-Switched Speech Recognition

In this paper, we describe the JHU-GoVivace submission for subtask 2 (code-switching task) of the Multilingual and Codeswitching ASR challenges for low resource Indian languages. We built a hybrid HMM-DNN system with several improvements over the provided baseline in terms of lexical, language, and acoustic modeling. For lexical modeling, we investigate using unified pronunciations and phonesets derived from the baseline lexicon and publicly available Wikipron lexicons in Bengali and Hindi to expand the pronunciation lexicons. We explore several neural network architectures, along with supervised pretraining and multilingual training for acoustic modeling. We also describe how we used large externally crawled web text for language modeling. Since the challenge data contain artefacts such as misalignments, various data cleanup methods are explored, including acoustic-driven pronunciation learning to help discover Indian-accented pronunciations for English words as well as transcribed punctuation. As a result of these efforts, our best systems achieve transliterated WERs of 19.5% and 23.2% on the non-duplicated development sets for Hindi-English and Bengali-English, respectively.

[1]  Xiaohui Zhang,et al.  Acoustic Data-Driven Lexicon Learning Based on a Greedy Pronunciation Selection Framework , 2017, INTERSPEECH.

[2]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[3]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[5]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[6]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Josef R. Novak,et al.  Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework , 2015, Natural Language Engineering.

[9]  Arya D. McCarthy,et al.  Massively Multilingual Pronunciation Modeling with WikiPron , 2020, LREC.

[10]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[11]  Sanjeev Khudanpur,et al.  JHU Kaldi system for Arabic MGB-3 ASR challenge using diarization, audio-transcript alignment and transfer learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[12]  Hermann Ney,et al.  RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation , 2019, INTERSPEECH.

[13]  Brian Kingsbury,et al.  Transliteration Based Data Augmentation for Training Multilingual ASR Acoustic Models in Low Resource Settings , 2020, INTERSPEECH.

[14]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[17]  Sanjeev Khudanpur,et al.  PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR , 2020, INTERSPEECH.

[18]  Samarth Bharadwaj,et al.  Multilingual and code-switching ASR challenges for low resource Indian languages , 2021, Interspeech.

[19]  Jian Wang,et al.  Neural Network Language Modeling with Letter-Based Features and Importance Sampling , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.