论文信息 - Syllable-Based Acoustic Modeling With Lattice-Free MMI for Mandarin Speech Recognition

Syllable-Based Acoustic Modeling With Lattice-Free MMI for Mandarin Speech Recognition

Most automatic speech recognition (ASR) systems in past decades have used context-dependent (CD) phones as the fundamental acoustic units. However, these phone-based approaches lack an easy and efficient way for modeling long-term temporal dependencies. Compared with phone units, syllables span a longer time, typically several phones, thereby having more stable acoustic realizations. In this work, we aim to train a syllable-based acoustic model for Mandarin ASR with lattice-free maximum mutual information (LF-MMI) criterion. We expect that, the combination of longer linguistic units, the RNN-based model structure and the sequence-level objective function, can result in better modeling of long-term temporal acoustic variations. We make multiple modifications to improve the performance of syllable-based AM and benchmark our models on two large-scale databases. Experimental results show that the proposed syllable-based AM performs much better than the CD phone-based baseline, especially on noisy test sets, with faster decoding speed.

Jie Li | Yan Li | Zhiyun Fan | Xiaorui Wang

[1] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Xiangang Li,et al. A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition , 2013, Neurocomputing.

[3] Hao Wu,et al. Context dependent syllable acoustic model for continuous Chinese speech recognition , 2007, INTERSPEECH.

[4] Hui Bu,et al. AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale , 2018, ArXiv.

[5] Michael Picheny,et al. Decision trees for phonological rules in continuous speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[6] Herman J. M. Steeneken,et al. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[7] Izhak Shafran,et al. Context dependent phone models for LSTM RNN acoustic modelling , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Yiming Wang,et al. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[9] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[10] K. Munhall,et al. Coarticulation: Theory, Data, and Techniques , 2001 .

[11] Steven Greenberg,et al. Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation , 1999, Speech Commun..

[12] Yiming Wang,et al. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[13] S. J. Young,et al. Tree-based state tying for high accuracy acoustic modelling , 1994 .

[14] Sanjeev Khudanpur,et al. Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[15] Shuang Xu,et al. Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese , 2018, INTERSPEECH.

[16] Parisa Haghani,et al. Syllable-based acoustic modeling with CTC-SMBR-LSTM , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[18] Yiming Wang,et al. Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs , 2018, IEEE Signal Processing Letters.

[19] Hao Zheng,et al. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[20] Shuang Xu,et al. A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese , 2018, ICONIP.

[21] William J. Hardcastle,et al. The origin of coarticulation , 2006 .

[22] Joseph Picone,et al. Syllable-based large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[23] Tara N. Sainath,et al. Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[24] Wenju Liu,et al. Improved Syllable Based Acoustic Modeling by Inter-Syllable Transition Model for Continuous Chinese Speech Recognition , 2009, 2009 Chinese Conference on Pattern Recognition.

[25] Alex Acero,et al. Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[26] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.