论文信息 - Lattice-Free Mmi Adaptation of Self-Supervised Pretrained Acoustic Models

Lattice-Free Mmi Adaptation of Self-Supervised Pretrained Acoustic Models

In this work, we propose lattice-free MMI (LFMMI) for supervised adaptation of self-supervised pretrained acoustic model. We pretrain a Transformer model on thousand hours of untranscribed Librispeech data followed by supervised adaptation with LFMMI on three different datasets. Our results show that fine-tuning with LFMMI, we consistently obtain relative WER improvements of 10% and 35.3% on the clean and other test sets of Librispeech (100h), 10.8% on Switchboard (300h), and 4.3% on Swahili (38h) and 4.4% on Tagalog (84h) compared to the baseline trained only with supervised data.

[1] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[2] Mark J. F. Gales,et al. Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.

[3] Hervé Bourlard,et al. Unbiased Semi-Supervised LF-MMI Training Using Dropout , 2019, INTERSPEECH.

[4] Srikanth Madikeri,et al. Pkwrap: a PyTorch Package for LF-MMI Training of Acoustic Models , 2020, ArXiv.

[5] Alexei Baevski,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[6] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Martin Karafiát,et al. Semi-supervised bootstrapping approach for neural network feature extractor training , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[8] John J. Godfrey,et al. SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9] Sanjeev Khudanpur,et al. Flat-Start Single-Stage Discriminatively Trained HMM-Based Models for ASR , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[11] Hao Tang,et al. An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[12] Shang-Wen Li,et al. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13] Thomas Hain,et al. Semi-supervised DNN training in meeting recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[14] Franccois Fleuret,et al. Fast Transformers with Clustered Attention , 2020, NeurIPS.

[15] Sanjeev Khudanpur,et al. Semi-Supervised Training of Acoustic Models Using Lattice-Free MMI , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Richard M. Schwartz,et al. Unsupervised acoustic and language model training with small amounts of labelled data , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17] Lukás Burget,et al. Semi-Supervised DNN Training with Word Selection for ASR , 2017, INTERSPEECH.

[18] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[19] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[20] Kenneth Ward Church,et al. Deep neural network features and semi-supervised training for low resource speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21] Karen Livescu,et al. Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23] Tatsuya Kawahara,et al. Transfer Learning of Language-independent End-to-end ASR with Language Model Fusion , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] George Zavaliagkos,et al. Using untranscribed training data to improve performance , 1998, ICSLP.

[25] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[26] Hermann Ney,et al. Unsupervised training of acoustic models for large vocabulary continuous speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.