Speed improvements to Information Retrieval-based dynamic time warping using hierarchical K-Means clustering

With the increase in multi-media data over the Internet, query by example spoken term detection (QbE-STD) has become important in providing a search mechanism to find spoken queries in spoken audio. Audio search algorithms should be efficient in terms of speed and memory to handle large audio files. In general, approaches derived from the well known dynamic time warping (DTW) algorithm suffer from scalability problems. To overcome such problems, an Information Retrieval-based DTW (IR-DTW) algorithm has been proposed recently. IR-DTW borrows techniques from Information Retrieval community to detect regions which are more likely to contain the spoken query and then uses a standard DTW to obtain exact start and end times. One drawback of the IR-DTW is the time taken for the retrieval of similar reference points for a given query point. In this paper we propose a method to improve the search performance of IR-DTW algorithm using a clustering based technique. The proposed method has shown an estimated speedup of 2400X.

[1]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[2]  Florian Metze,et al.  The Spoken Web Search Task , 2012, MediaEval.

[3]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[4]  Ashish Verma,et al.  A Language Independent Approach to Audio Search , 2011, INTERSPEECH.

[5]  Etienne Barnard,et al.  ASR corpus design for resource-scarce languages , 2009, INTERSPEECH.

[6]  Richard Sproat,et al.  Lattice-Based Search for Spoken Utterance Retrieval , 2004, NAACL.

[7]  Aren Jansen,et al.  Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[8]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[9]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[10]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[11]  Aren Jansen,et al.  The JHU-HLTCOE Spoken Web Search System for MediaEval 2012 , 2012, MediaEval.

[12]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[13]  J. C. Speech Hybrid word-subword decoding for spoken term detection , 2008 .

[14]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[15]  Lin-Shan Lee,et al.  Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping , 2010, INTERSPEECH.

[16]  X. Anguera Speaker independent discriminant feature extraction for acoustic pattern-matching , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).