A fast hierarchical search algorithm for discriminative keyword spotting

A keyword spotter can be considered as a binary classifier which classifies a set of uttered sentences into two groups on the basis of whether they contain target keywords or not. For this classification task, the keyword spotter needs to identify the target keywords locations based on a fast and accurate search algorithm. In our previous works, we exploited a modified Viterbi (M-Viterbi) search algorithm which has two known drawbacks. First, to locate the target keywords, it runs an exhaustive search through all possible segments of input speech. Second, while computing the start and end time-frames of each new phone, it makes the keyword spotter to trace-back and re-evaluate the timing alignments of all previous one(s), despite the fact that very limited amount of data -if any- would get updated as a result. These two pitfalls cause a dramatically enlarged search space as well as a significant increase in computational complexity. In this paper, we propose a Hierarchical Search (H-Search) algorithm which allows the system to ignore some segments of input speech at each level of hierarchy, according to their lower likelihood of containing the target keywords. In addition, unlike the M-Viterbi algorithm, the H-Search algorithm does not demand repeated evaluations when computing the current phone alignment which, in turn, results in a narrowed-down search space (O(TP) versus O(TPLmax) - where T is number of frames, P is number of keyword phones and Lmax is the maximum phone duration) as well as a decreased computational complexity (O(TPLmax) versus O(TPLmax3)) compared to those of the M-Viterbi algorithm. We applied the H-Search algorithm to the classification part of an Evolutionary Discriminative Keyword Spotting (EDKWS) system introduced in our previous works. The experimental results indicate that the H-Search algorithm is executed 100 times faster than the M-Viterbi algorithm while the performance of the EDKWS system degrades no more than two percent compared to that of the M-Viterbi algorithm.

[1]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[2]  Mitchel Weintraub,et al.  LVCSR log-likelihood ratio scoring for keyword spotting , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Pavel Matejka,et al.  Search in Speech for Public Security and Defense , 2007 .

[4]  Hynek Hermansky,et al.  Combination of strongly and weakly constrained recognizers for reliable detection of OOVS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Richard C. Rose,et al.  An approach for efficient open vocabulary spoken term detection , 2014, Speech Commun..

[6]  Georges Linarès,et al.  OOV Proper Name retrieval using topic and lexical context models , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Ahmad Akbari,et al.  Performance evaluation for an HMM-based keyword spotter and a large-margin based one in noisy environments , 2011, WCIT.

[8]  Dong Wang,et al.  Term-Dependent Confidence Normalisation for Out-of-Vocabulary Spoken Term Detection , 2012, Journal of Computer Science and Technology.

[9]  Samy Bengio,et al.  Posterior based keyword spotting with a priori thresholds , 2006, INTERSPEECH.

[10]  Mehryar Mohri,et al.  Confidence Intervals for the Area Under the ROC Curve , 2004, NIPS.

[11]  Bhuvana Ramabhadran,et al.  A new method for OOV detection using hybrid word/fragment system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Hsin-Min Wang,et al.  Improved HMM/SVM methods for automatic phoneme segmentation , 2007, INTERSPEECH.

[13]  I-Fan Chen,et al.  A keyword-aware grammar framework for LVCSR-based spoken keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[15]  Jürgen Schmidhuber,et al.  An Application of Recurrent Neural Networks to Discriminative Keyword Spotting , 2007, ICANN.

[16]  Xin Chen,et al.  Deep neural network acoustic models for spoken assessment applications , 2015, Speech Commun..

[17]  Panikos Heracleous,et al.  An efficient keyword spotting technique using a complementary language for filler models training , 2003, INTERSPEECH.

[18]  Luis A. Hernández Gómez,et al.  Automatic phonetic segmentation , 2003, IEEE Trans. Speech Audio Process..

[19]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[20]  Dong Wang,et al.  A comparison of grapheme and phoneme-based units for Spanish spoken term detection , 2008, Speech Commun..

[21]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Harald Höge,et al.  Efficient methods for detecting keywords in continuous speech , 1997, EUROSPEECH.

[23]  Xin Xu,et al.  Robust and Fast Lyric Search based on Phonetic Confusion Matrix , 2009, ISMIR.

[24]  Ahmad Akbari,et al.  A robust keyword spotting system for Persian conversational telephone speech using feature and score normalization and ARMA filter , 2011, 2011 IEEE GCC Conference and Exhibition (GCC).

[25]  Brian Kingsbury,et al.  Fast decoding for open vocabulary spoken term detection , 2009, HLT-NAACL.

[26]  R. C. Rose,et al.  Keyword detection in conversational speech utterances using hidden Markov model based continuous speech recognition , 1995, Comput. Speech Lang..

[27]  Lukás Burget,et al.  Phoneme Based Acoustics Keyword Spotting in Informal Continuous Speech , 2005, TSD.

[28]  Hui Lin,et al.  OOV detection by joint word/phone lattice alignment , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[29]  Samy Bengio,et al.  Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods , 2009 .

[30]  Lukás Burget,et al.  Comparison of keyword spotting approaches for informal continuous speech , 2005, INTERSPEECH.

[31]  Yoram Singer,et al.  An Online Algorithm for Hierarchical Phoneme Classification , 2004, MLMI.

[32]  M. L. Rossen,et al.  A whole word recurrent neural network for keyword spotting , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Ümit Yapanel GARBAGE MODELING TECHNIQUES FOR A TURKISH KEYWORD SPOTTING SYSTEM , 2000 .

[34]  Gérard Chollet,et al.  Confidence measures for keyword spotting using support vector machines , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[35]  Tara N. Sainath Island-driven search using broad phonetic classes , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[36]  Gérard Chollet,et al.  Keyword Spotting Using Support Vector Machines , 2002, TSD.

[37]  Richard M. Schwartz,et al.  Combination of search techniques for improved spotting of OOV keywords , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Fabio Valente,et al.  Improving acoustic based keyword spotting using LVCSR lattices , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  J. Scott Olsson,et al.  Fast Unconstrained Audio Search in Numerous Human Languages , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[40]  Ahmad Akbari,et al.  Keyword spotting using an evolutionary-based classifier and discriminative features , 2013, Eng. Appl. Artif. Intell..

[41]  A. Akbari,et al.  A fast search technique for discriminative keyword spotting , 2012, The 16th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP 2012).

[42]  Samy Bengio,et al.  Discriminative keyword spotting , 2009, Speech Commun..

[43]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[44]  Daniel Jurafsky,et al.  First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs , 2014, ArXiv.

[45]  Jeff A. Bilmes,et al.  Low-resource noise-robust feature post-processing on Aurora 2.0 , 2002, INTERSPEECH.

[46]  Victor Zue,et al.  A segment-based wordspotter using phonetic filler models , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[47]  Joel Praveen Pinto,et al.  Keyword Spotting on Word Lattices , 2007 .

[48]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[49]  D. Fohr,et al.  Improving the performance of a keyword spotting system by using support vector machines , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).