Keyword Spotting using Dynamic Time Warping and Convolutional Recurrent Networks

This paper proposes a method for keyword spotting, which first converts utterances to grayscale images via a modified Dynamic Time Warping (DTW) algorithm, and then splits the images into frames which are fed in sequence to a Convolutional Recurrent Deep Neural Network (CRDNN). DTW is employed because of its capability to accurately capture similarities between time sequences, while the neural network exploits the textural features of the DTW matrix for classification. We explore three alternative formulations of the DTW algorithm for extracting the similarity matrices, as well as three different conversion methods from the similarity matrix to a gray-scale image. As opposed to previous works, we employ a recurrent network to consider sequential information encoded in image segments. We perform several evaluations on the TIMIT corpus and find that the system reaches a detection performance of 95%.

[1]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[3]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[4]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Manfred K. Warmuth,et al.  Speech Recognition: Keyword Spotting Through Image Recognition , 2018, ArXiv.

[6]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[7]  Timothy J. Hazen,et al.  Query-by-example spoken term detection using phonetic posteriorgram templates , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[8]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[9]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[10]  S. R. Mahadeva Prasanna,et al.  Spoken Keyword Detection Using Joint DTW-CNN , 2018, INTERSPEECH.

[11]  Jorge Proença,et al.  Query by example search with segmented dynamic time warping for non-exact spoken queries , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[12]  Kishore Prahallad,et al.  Query-by-Example Spoken Term Detection using Frequency Domain Linear Prediction and Non-Segmental Dynamic Time Warping , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.