Learning the inter-frame distance for discriminative template-based keyword detection

This paper proposes a discriminative approach to template-based keyword detection. We introduce a method to learn the distance used to compare acoustic frames, a crucial element for template matching approaches. The proposed algorithm estimates the distance from data, with the objective to produce a detector maximizing the Area Under the receiver operating Curve (AUC), i.e. the standard evaluation measure for the keyword detection problem. The experiments performed over a large corpus, SpeechDatII, suggest that our model is effective compared to an HMM system, e.g. the proposed approach reaches 93.8% of averaged AUC compared to 87.9% for the HMM.

[1]  Aaron E. Rosenberg,et al.  An investigation of the use of dynamic time warping for word spotting and connected speech recognition , 1980, ICASSP.

[2]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3]  Jean-Luc Gauvain,et al.  Speaker adaptation based on MAP estimation of HMM parameters , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Hermann Ney,et al.  The use of a one-stage dynamic programming algorithm for connected word recognition , 1984 .

[5]  Herbert Gish,et al.  Phonetic training and language modeling for word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Dirk Van Compernolle,et al.  Maximum mutual information training of distance measures for template based speech recognition , 2005 .

[7]  Harald Höge,et al.  Efficient methods for detecting keywords in continuous speech , 1997, EUROSPEECH.

[8]  Y. Ermoliev,et al.  Stochastic Generalized Gradient Method with Application to Insurance Risk Management , 1997 .

[9]  Hynek Hermansky TRAP-TANDEM: data-driven extraction of temporal features from speech , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[10]  Victor Zue,et al.  A segment-based wordspotter using phonetic filler models , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Eric Sanders,et al.  Speechdat multilingual speech databases for teleservices: across the finish line , 1999, EUROSPEECH.

[12]  Patrick Wambacq,et al.  A locally weighted distance measure for example based speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Mehryar Mohri,et al.  Confidence Intervals for the Area Under the ROC Curve , 2004, NIPS.

[14]  Guillaume Gravier,et al.  Overview of the 2000-2001 ELISA Consortium research activities , 2001, Odyssey.

[15]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[16]  Richard M. Stern,et al.  On the effects of speech rate in large vocabulary speech recognition systems , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[17]  Jithendra Vepa,et al.  Using posterior-based features in template matching for speech recognition , 2006, INTERSPEECH.

[18]  Samy Bengio,et al.  Discriminative keyword spotting , 2009, Speech Commun..