Phone Synchronous Decoding with CTC Lattice

Connectionist Temporal Classification (CTC) has recently shown improved efficiency in LVCSR decoding. One popular implementation is to use a CTC model to predict the phone posteriors at each frame which are then used for Viterbi beam search on a modified WFST network. This is still within the traditional frame synchronous decoding framework. In this paper, the peaky posterior property of a CTC model is carefully investigated and it is found that ignoring blank frames will not introduce additional search errors. Based on this phenomenon, a novel phone synchronous decoding framework is proposed. Here, a phone-level CTC lattice is constructed purely using the CTC acoustic model. The resultant CTC lattice is highly compact and removes tremendous search redundancy due to blank frames. Then, the CTC lattice can be composed with the standard WFST to yield the final decoding result. The proposed approach effectively separates the acoustic evidence calculation and the search operation. This not only significantly improves online search efficiency, but also allows flexible acoustic/linguistic resources to be used. Experiments on LVCSR tasks show that phone synchronous decoding can yield an extra 2-3 times speed up compared to the traditional frame synchronous CTC decoding implementation.

[1]  Florian Metze,et al.  An empirical exploration of CTC acoustic models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[3]  Joris Pelemans,et al.  Sparse non-negative matrix language modeling for skip-grams , 2015, INTERSPEECH.

[4]  Tara N. Sainath,et al.  Acoustic modelling with CD-CTC-SMBR LSTM RNNS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[5]  Yongqiang Wang,et al.  Efficient lattice rescoring using recurrent neural network language models , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Steve J. Young,et al.  Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yongqiang Wang,et al.  Simplifying long short-term memory acoustic models for fast training and decoding , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  S. M. Peeling,et al.  The use of variable frame rate analysis in speech recognition , 1991 .

[10]  Chin-Hui Lee,et al.  A study on cross-language knowledge integration in Mandarin LVCSR , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[11]  Georg Heigold,et al.  Multiframe deep neural networks for acoustic modeling , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[13]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[14]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[15]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[16]  Alex Acero,et al.  Context dependent phonetic string edit distance for automatic speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Ian McGraw,et al.  Personalized speech recognition on mobile devices , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Chin-Hui Lee,et al.  A Bottom-Up Modular Search Approach to Large Vocabulary Continuous Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Zheng-Hua Tan,et al.  Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection , 2010, IEEE Journal of Selected Topics in Signal Processing.

[21]  Abeer Alwan,et al.  On the use of variable frame rate analysis in speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[22]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[24]  Hermann Ney,et al.  Search Space Pruning Based on Anticipated Path Recombination in LVCSR , 2012, INTERSPEECH.

[25]  Steve J. Young,et al.  State clustering in hidden Markov model-based continuous speech recognition , 1994, Comput. Speech Lang..

[26]  Mehryar Mohri,et al.  The Design Principles of a Weighted Finite-State Transducer Library , 2000, Theor. Comput. Sci..

[27]  Atsushi Nakamura,et al.  Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[29]  Lalit R. Bahl,et al.  Design of a linguistic statistical decoder for the recognition of continuous speech , 1975, IEEE Trans. Inf. Theory.

[30]  Izhak Shafran,et al.  Discriminatively estimated joint acoustic, duration, and language model for speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[32]  Hugo Van hamme,et al.  Robust phone lattice decoding , 2006, INTERSPEECH.

[33]  Atsushi Nakamura,et al.  Speech Recognition Algorithms Based on Weighted Finite-State Transducers , 2013, Speech Recognition Algorithms Based on Weighted Finite-State Transducers.