Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition

In this work, we propose minimum Bayes risk (MBR) training of RNN-Transducer (RNN-T) for end-to-end speech recognition. Specifically, initialized with a RNN-T trained model, MBR training is conducted via minimizing the expected edit distance between the reference label sequence and on-the-fly generated N-best hypothesis. We also introduce a heuristic to incorporate an external neural network language model (NNLM) in RNN-T beam search decoding and explore MBR training with the external NNLM. Experimental results demonstrate an MBR trained model outperforms a RNN-T trained model substantially and further improvements can be achieved if trained with an external NNLM. Our best MBR trained system achieves absolute character error rate (CER) reductions of 1.2% and 0.5% on read and spontaneous Mandarin speech respectively over a strong convolution and transformer based RNN-T baseline trained on ~21,000 hours of speech.

[1]  Jun Wang,et al.  Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition , 2018, INTERSPEECH.

[2]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Tara N. Sainath,et al.  Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Lei Xie,et al.  Exploring RNN-Transducer for Chinese speech recognition , 2018, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[6]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[7]  Chengzhu Yu,et al.  A Multistage Training Framework for Acoustic-to-Word Model , 2018, INTERSPEECH.

[8]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[9]  Matt Shannon,et al.  Optimizing Expected Word Error Rate via Sampling for Speech Recognition , 2017, INTERSPEECH.

[10]  Kjell Schubert,et al.  Transformer-Transducer: End-to-End Speech Recognition with Self-Attention , 2019, ArXiv.

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Tara N. Sainath,et al.  Acoustic modelling with CD-CTC-SMBR LSTM RNNS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[13]  Hagen Soltau,et al.  Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[14]  Bhuvana Ramabhadran,et al.  Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, INTERSPEECH.

[15]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[17]  Lei Xie,et al.  Attention-Based End-to-End Speech Recognition in Mandarin , 2017, ArXiv.

[18]  Matt Shannon,et al.  Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping , 2017, INTERSPEECH.

[19]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[20]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[21]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[22]  Yifan Gong,et al.  Improving RNN Transducer Modeling for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[23]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Geoffrey Zweig,et al.  Transformer-Based Acoustic Modeling for Hybrid Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Qiang Huo,et al.  Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Georg Heigold,et al.  Asynchronous stochastic optimization for sequence training of deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[28]  Chao Weng,et al.  Dfsmn-San with Persistent Memory Model for Automatic Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[30]  Tara N. Sainath,et al.  An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[34]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Brian Kingsbury,et al.  Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).