论文信息 - A Comparison of Lattice-free Discriminative Training Criteria for Purely Sequence-trained Neural Network Acoustic Models

A Comparison of Lattice-free Discriminative Training Criteria for Purely Sequence-trained Neural Network Acoustic Models

In this work, three lattice-free (LF) discriminative training criteria for purely sequence-trained neural network acoustic models are compared on LVCSR tasks, namely maximum mutual information (MMI), boosted maximum mutual information (bMMI) and state-level minimum Bayes risk (sMBR). We demonstrate that, analogous to LF-MMI, a neural network acoustic model can also be trained from scratch using LF-bMMI or LF-sMBR criteria respectively without the need of cross-entropy pre-training. Furthermore, experimental results on Switchboard-300hrs and Switchboard+Fisher-2100hrs datasets show that models trained with LF-bMMI consistently outperform those trained with plain LF-MMI and achieve a relative word error rate (WER) reduction of ∼5% over competitive temporal convolution projected LSTM (TDNN-LSTMP) LF-MMI baselines.

Dong Yu | Chao Weng

[1] Biing-Hwang Juang,et al. Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[2] Yonghong Yan,et al. Output-Gate Projected Gated Recurrent Unit for Speech Recognition , 2018, INTERSPEECH.

[3] Brian Kingsbury,et al. Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4] George Saon,et al. The IBM 2016 English Conversational Telephone Speech Recognition System , 2016, INTERSPEECH.

[5] Geoffrey Zweig,et al. Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[6] Ke Li,et al. A Time-Restricted Self-Attention Layer for ASR , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Kai Yu,et al. Sequence Discriminative Training for Deep Learning based Acoustic Keyword Spotting , 2018, Speech Commun..

[8] Georg Heigold,et al. Asynchronous stochastic optimization for sequence training of deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Yonghong Yan,et al. An Exploration of Dropout with LSTMs , 2017, INTERSPEECH.

[10] Yiming Wang,et al. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[11] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[12] Xiaohui Zhang,et al. Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging , 2014, ICLR.

[13] Lukás Burget,et al. Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[14] Yiming Wang,et al. Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs , 2018, IEEE Signal Processing Letters.

[15] Dong Yu,et al. Error back propagation for sequence training of Context-Dependent Deep NetworkS for conversational speech transcription , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16] Lalit R. Bahl,et al. Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17] Naoyuki Kanda,et al. Lattice-free State-level Minimum Bayes Risk Training of Acoustic Models , 2018, INTERSPEECH.

[18] Daniel Povey,et al. Large scale MMIE training for conversational telephone speech recognition , 2000 .

[19] Matt Shannon,et al. Optimizing Expected Word Error Rate via Sampling for Speech Recognition , 2017, INTERSPEECH.

[20] Brian Kingsbury,et al. Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21] Brian Kingsbury,et al. Evaluation of Proposed Modifications to MPE for Large Scale Discriminative Training , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[22] Jinyu Li,et al. Progressive Joint Modeling in Unsupervised Single-Channel Overlapped Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.