Vectorization of hypotheses and speech for faster beam search in encoder decoder-based speech recognition

Attention-based encoder decoder network uses a left-to-right beam search algorithm in the inference step. The current beam search expands hypotheses and traverses the expanded hypotheses at the next time step. This traversal is implemented using a for-loop program in general, and it leads to speed down of the recognition process. In this paper, we propose a parallelism technique for beam search, which accelerates the search process by vectorizing multiple hypotheses to eliminate the for-loop program. We also propose a technique to batch multiple speech utterances for off-line recognition use, which reduces the for-loop program with regard to the traverse of multiple utterances. This extension is not trivial during beam search unlike during training due to several pruning and thresholding techniques for efficient decoding. In addition, our method can combine scores of external modules, RNNLM and CTC, in a batch as shallow fusion. We achieved 3.7 x speedup compared with the original beam search algorithm by vectoring hypotheses, and achieved 10.5 x speedup by further changing processing unit to GPU.

[1]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[3]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Justin Luitjens,et al.  A GPU-based WFST Decoder with Exact Lattice Generation , 2018, INTERSPEECH.

[7]  Brian Kingsbury,et al.  Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yifan Gong,et al.  Advancing Acoustic-to-Word CTC Model , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Xavier L. Aubert,et al.  An overview of decoding techniques for large vocabulary continuous speech recognition , 2002, Comput. Speech Lang..

[10]  Kurt Keutzer,et al.  A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit , 2009, INTERSPEECH.

[11]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[13]  Sadaoki Furui,et al.  Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition , 2009, Comput. Speech Lang..

[14]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Hermann Ney,et al.  Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[16]  Yu Zhang,et al.  Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM , 2017, INTERSPEECH.

[17]  Hitoshi Isahara,et al.  Spontaneous Speech Corpus of Japanese , 2000, LREC.

[18]  K. Maekawa CORPUS OF SPONTANEOUS JAPANESE : ITS DESIGN AND EVALUATION , 2003 .

[19]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .