One in a hundred: Select the best predicted sequence from numerous candidates for streaming speech recognition

The RNN-Transducers and improved attention-based sequence-to-sequence models are widely applied to streaming speech recognition. Compared with these two end-to-end models, the CTC model is more efficient in training and inference. However, it cannot capture the dependencies between the output tokens. We assume that we can pick out the best sequence among the first N candidates predicted by the CTC model. Enough number and enough diversity of candidates can compensate it for the lack of language modeling ability. Therefore, we improve the hybrid CTC and attention model, and introduce a two-stage inference method named one-in-a-hundred (OAH). During inference, we first generate many candidates by the CTC decoder in a streaming fashion. Then the transformer decoder selects the best candidate based on the corresponding acoustic encoded states. All the experiments are conducted on a Chinese Mandarin dataset AISHELL-1. The results show that our proposed model can implement stream decoding in a fast and straightforward way. Our model can achieve up to 20% reduction in the character error rate than the baseline CTC model. In addition, our model can also perform non-streaming decoding.

[1]  Jonathan Le Roux,et al.  Triggered Attention for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Quoc V. Le,et al.  A Neural Transducer , 2015, 1511.04868.

[3]  Zhiheng Huang,et al.  Self-attention Networks for Connectionist Temporal Classification in Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Khe Chai Sim,et al.  Efficient Implementation of Recurrent Neural Network Transducer in Tensorflow , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[5]  Boris Ginsburg,et al.  Jasper: An End-to-End Convolutional Neural Acoustic Model , 2019, INTERSPEECH.

[6]  Tara N. Sainath,et al.  Improving the Performance of Online Neural Transducer Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yonghui Wu,et al.  ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context , 2020, INTERSPEECH.

[8]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[9]  Colin Raffel,et al.  Monotonic Chunkwise Attention , 2017, ICLR.

[10]  Shuai Zhang,et al.  Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition , 2020, INTERSPEECH.

[11]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[12]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Daniel Jurafsky,et al.  First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs , 2014, ArXiv.

[14]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[15]  Tara N. Sainath,et al.  Two-Pass End-to-End Speech Recognition , 2019, INTERSPEECH.

[16]  Linhao Dong,et al.  CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[19]  Yifan Gong,et al.  Improving RNN Transducer Modeling for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[21]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[22]  Jiangyan Yi,et al.  Self-Attention Transducers for End-to-End Speech Recognition , 2019, INTERSPEECH.

[23]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[24]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).