Parallelization Strategies for a Dynamic Lexical Tree Decoder

Increasingly, physical limitations lead to a shift from high clocked single core processors to CPUs with up to eight, or more, independent but slower processing cores, and multi-core or even multi-CPU computers. In order to retain performance gains in the future, the speech decoding process has to be re-organized to employ a certain amount of thread-level parallelism on those CPUs. In this work, we compare two common approaches for dynamic prefix tree decoders: Parallel Score Computation and Parallel Search, and a combination of both. Both have already been studied intensively, however it is shown here, that the latter suffers from hardware cache effects which limit absolute speed-ups and scalability in general. We propose a cache efficient variation of the Parallel Score Computation which is more scalable and faster than any other parallel strategy we compared it with.

[1]  Masahiko Yoshimoto,et al.  Parallelized viterbi processor for 5, 000-word large-vocabulary real-time continuous speech recognition FPGA system , 2009, INTERSPEECH.

[2]  Ryosuke Isotani,et al.  Parallel LVCSR Algorithm for Cellphone-Oriented Multicore Processors , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  Hermann Ney,et al.  Using SIMD instructions for fast likelihood calculation in LVCSR , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[4]  Eric A. Hansen,et al.  Analysis of a parallel lexical-tree-based speech decoder for multi-core processors , 2009, 2009 17th European Signal Processing Conference.

[5]  Ralf Schlüter,et al.  Parallel fast likelihood computation for LVCSR using mixture decomposition , 2009, INTERSPEECH.

[6]  Eric A. Hansen,et al.  A lexical-tree division-based approach to parallelizing a cross-word speech decoder for multi-core processors , 2008, 2008 16th European Signal Processing Conference.

[7]  Wonyong Sung,et al.  OpenMP-based parallel implementation of a continuous speech recognizer on a multi-core system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[9]  Ralf Schlüter,et al.  Parallel lexical-tree based LVCSR on multi-core processors , 2010, INTERSPEECH.

[10]  Anne Rogers,et al.  Parallel Speech Recognition , 2004, International Journal of Parallel Programming.

[11]  Pierre Dumouchel,et al.  Using parallel architectures in speech recognition , 2009, INTERSPEECH.

[12]  J. M. Bull,et al.  Measuring Synchronisation and Scheduling Overheads in OpenMP , 2007 .

[13]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[14]  StateStart StateFinalFigure Parallel Implementation of Fast Beam Search for Speaker-independent Continuous Speech Recognition , 1993 .

[15]  Kurt Keutzer,et al.  A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit , 2009, INTERSPEECH.