On Lattice-Free Boosted MMI Training of HMM and CTC-Based Full-Context ASR Models

Hybrid automatic speech recognition (ASR) models are typically sequentially trained with CTC or LF-MMI criteria. However, they have vastly different legacies and are usually implemented in different frameworks. In this paper, by decoupling the concepts of modeling units and label topologies and building proper numerator/denominator graphs accordingly, we establish a generalized framework for hybrid acoustic modeling (AM). In this framework, we show that LF-MMI is a powerful training criterion applicable to both limited-context and full-context models, for wordpiece/mono-char/bi-char/chenone units, with both HMM/CTC topologies. From this framework, we propose three novel training schemes: chenone(ch)/wordpiece(wp)-CTC-bMMI, and wordpiece(wp)-HMM-bMMI with different advantages in training performance, decoding efficiency and decoding time-stamp accuracy. The advantages of different training schemes are evaluated comprehensively on Librispeech, and wp-CTC-bMMI and ch-CTC-bMMI are evaluated on two real world ASR tasks to show their effectiveness. Besides, we also show bi-char(bc) HMM-MMI models can serve as better alignment models than traditional nonneural GMM-HMMs.

[1]  Dong Yu,et al.  A Comparison of Lattice-free Discriminative Training Criteria for Purely Sequence-trained Neural Network Acoustic Models , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[3]  Brian Kingsbury,et al.  Discriminative feature-space transforms using deep neural networks , 2012, INTERSPEECH.

[4]  Kai Yu,et al.  Sequence Discriminative Training for Deep Learning based Acoustic Keyword Spotting , 2018, Speech Commun..

[5]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[7]  Shinji Watanabe,et al.  Promising Accurate Prefix Boosting for Sequence-to-sequence ASR , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Geoffrey Zweig,et al.  Improving RNN Transducer Based ASR with Auxiliary Tasks , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[9]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[10]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[11]  Geoffrey Zweig,et al.  Benchmarking LF-MMI, CTC And RNN-T Criteria For Streaming ASR , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[12]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[13]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[14]  Sanjeev Khudanpur,et al.  Semi-Supervised Training of Acoustic Models Using Lattice-Free MMI , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Sanjeev Khudanpur,et al.  Pronunciation and silence probability modeling for ASR , 2015, INTERSPEECH.

[16]  Hermann Ney,et al.  A New Training Pipeline for an Improved Neural Transducer , 2020, INTERSPEECH.

[17]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Gil Keren,et al.  Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion , 2021, Interspeech 2021.

[19]  Sanjeev Khudanpur,et al.  Flat-Start Single-Stage Discriminatively Trained HMM-Based Models for ASR , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[22]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[24]  Sanjeev Khudanpur,et al.  Semi-supervised maximum mutual information training of deep neural network acoustic models , 2015, INTERSPEECH.

[25]  Noah A. Smith,et al.  Softmax-Margin Training for Structured Log-Linear Models , 2010 .

[26]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Dong Yu,et al.  Error back propagation for sequence training of Context-Dependent Deep NetworkS for conversational speech transcription , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Johan Schalkwyk,et al.  Learning acoustic frame labeling for speech recognition with recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Gil Keren,et al.  Alignment Restricted Streaming Recurrent Neural Network Transducer , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[31]  G. Zweig,et al.  Fast, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces , 2020, INTERSPEECH.

[32]  Lukás Burget,et al.  Semi-supervised training of Deep Neural Networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[33]  Hermann Ney,et al.  Why does CTC result in peaky behavior? , 2021, ArXiv.

[34]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Geoffrey Zweig,et al.  From Senones to Chenones: Tied Context-Dependent Graphemes for Hybrid Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[36]  Shinji Watanabe,et al.  Using ASR Methods for OCR , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[37]  Geoffrey E. Hinton,et al.  A time-delay neural network architecture for isolated word recognition , 1990, Neural Networks.

[38]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[40]  Sanjeev Khudanpur,et al.  PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR , 2020, INTERSPEECH.

[41]  Zhijian Ou,et al.  CRF-based Single-stage Acoustic Modeling with CTC Topology , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Frank Zhang,et al.  Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition , 2020, ArXiv.