A GPU-based parallel WFST decoder on nnet3

One performance-intensive part of automatic speech recognition is the weighted finite-state transducer (WFST) decoding. To solve the problem, we expand parallel Graphics Processing Units (GPU) computing to the decoding period. We describe extension work based on Kaldi toolkit for speech recognition research. Our work can support weighted finite-state transducer decoding on Kaldi neural nets with CUDA toolkit. Our paper also expands an efficient parallel Viterbi beam decoding algorithm to decrease the speech recognition Real Time Factor (RTF) value. Together with our optimization algorithm, we have reached 2.3x speed up on the AISHELL corpus decoding. We also implement nnet3 decoder that improves real-time speed up with no word error rate raise.

[1]  Ngoc Thang Vu,et al.  Generating exact lattices in the WFST framework , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[3]  Ian R. Lane,et al.  Efficient On-The-Fly Hypothesis Rescoring in a Hybrid GPU/CPU-based Large Vocabulary Continuous Speech Recognition Engine , 2012, INTERSPEECH.

[4]  Geoffrey Zweig,et al.  Parallelizing WFST speech decoders , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Kurt Keutzer,et al.  Exploring recognition network representations for efficient speech inference on highly parallel platforms , 2010, INTERSPEECH.

[6]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[7]  Justin Luitjens,et al.  A GPU-based WFST Decoder with Exact Lattice Generation , 2018, INTERSPEECH.