Fast spoken query detection using lower-bound Dynamic Time Warping on Graphical Processing Units

In this paper we present a fast unsupervised spoken term detection system based on lower-bound Dynamic Time Warping (DTW) search on Graphical Processing Units (GPUs). The lower-bound estimate and the K nearest neighbor DTW search are carefully designed to fit the GPU parallel computing architecture. In a spoken term detection task on the TIMIT corpus, a 55x speed-up is achieved compared to our previous implementation on a CPU without affecting detection performance. On large, artificially created corpora, measurements show that the total computation time of the entire spoken term detection system grows linearly with corpus size. On average, searching a keyword on a single desktop computer with modern GPUs requires 2.4 seconds/corpus hour.

[1]  Jan Vanek,et al.  Optimization of the Gaussian Mixture Model Evaluation on GPU , 2011, INTERSPEECH.

[2]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[3]  Eamonn J. Keogh,et al.  Accelerating Dynamic Time Warping Subsequence Search with GPUs and FPGAs , 2010, 2010 IEEE International Conference on Data Mining.

[4]  Pierre Dumouchel,et al.  GPU accelerated acoustic likelihood computations , 2008, INTERSPEECH.

[5]  Kurt Keutzer,et al.  A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit , 2009, INTERSPEECH.

[6]  Sadaoki Furui,et al.  Fast acoustic computations using graphics processors , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  James R. Glass,et al.  An inner-product lower-bound estimate for dynamic time warping , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).