Discriminative word-spotting using ordered spectro-temporal patch features

We present a novel architecture for word-spotting which is trained from a small number of examples to classify an utterance as containing a target keyword or not. The word-spotting architecture relies on a novel feature set consisting of a set of ordered spectro-temporal patches which are extracted from exemplar mel-spectra of target keywords. A local pooling operation across frequency and time is introduced which endows the extracted patch features with the flexibility to match novel unseen keywords. Finally, we describe how to train a support vector machine classifier to separate between keyword and nonkeyword patch feature responses. We present preliminary results indicating that our word-spotting architecture achieves a detection rate of 70-95% with false positive rates of about 0.252 false positives per minute.

[1]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[2]  James R. Glass,et al.  Heterogeneous measurements and multiple classifiers for speech recognition , 1998, ICSLP.

[3]  W. Russell,et al.  Continuous hidden Markov modeling for speaker-independent word spotting , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[4]  Michael Weintraub,et al.  Keyword-spotting using SRI's DECIPHER large-vocabulary speech-recognition system , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  K. Sen,et al.  Spectral-temporal Receptive Fields of Nonlinear Auditory Neurons Obtained Using Natural Sounds , 2022 .

[6]  PoggioTomaso,et al.  Robust Object Recognition with Cortex-Like Mechanisms , 2007 .

[7]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[8]  Nello Cristianini,et al.  Support Vector Machines and Kernel Methods: The New Generation of Learning Machines , 2002, AI Mag..

[9]  Thomas Serre,et al.  Robust Object Recognition with Cortex-Like Mechanisms , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Steve J. Young,et al.  A fast lattice-based approach to vocabulary independent wordspotting , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[12]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.