论文信息 - Learning the inter-frame distance for discriminative template-based keyword detection

Learning the inter-frame distance for discriminative template-based keyword detection

This paper proposes a discriminative approach to template-based keyword detection. We introduce a method to learn the distance used to compare acoustic frames, a crucial element for template matching approaches. The proposed algorithm estimates the distance from data, with the objective to produce a detector maximizing the Area Under the receiver operating Curve (AUC), i.e. the standard evaluation measure for the keyword detection problem. The experiments performed over a large corpus, SpeechDatII, suggest that our model is effective compared to an HMM system, e.g. the proposed approach reaches 93.8% of averaged AUC compared to 87.9% for the HMM.

Samy Bengio | David Grangier | David Grangier | Samy Bengio

[1] Aaron E. Rosenberg,et al. An investigation of the use of dynamic time warping for word spotting and connected speech recognition , 1980, ICASSP.

[2] Yann LeCun,et al. Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3] Jean-Luc Gauvain,et al. Speaker adaptation based on MAP estimation of HMM parameters , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4] Hermann Ney,et al. The use of a one-stage dynamic programming algorithm for connected word recognition , 1984 .

[5] Herbert Gish,et al. Phonetic training and language modeling for word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6] Dirk Van Compernolle,et al. Maximum mutual information training of distance measures for template based speech recognition , 2005 .

[7] Harald Höge,et al. Efficient methods for detecting keywords in continuous speech , 1997, EUROSPEECH.

[8] Y. Ermoliev,et al. Stochastic Generalized Gradient Method with Application to Insurance Risk Management , 1997 .

[9] Hynek Hermansky. TRAP-TANDEM: data-driven extraction of temporal features from speech , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[10] Victor Zue,et al. A segment-based wordspotter using phonetic filler models , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11] Eric Sanders,et al. Speechdat multilingual speech databases for teleservices: across the finish line , 1999, EUROSPEECH.

[12] Patrick Wambacq,et al. A locally weighted distance measure for example based speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13] Mehryar Mohri,et al. Confidence Intervals for the Area Under the ROC Curve , 2004, NIPS.

[14] Guillaume Gravier,et al. Overview of the 2000-2001 ELISA Consortium research activities , 2001, Odyssey.

[15] J. Rice. Mathematical Statistics and Data Analysis , 1988 .

[16] Richard M. Stern,et al. On the effects of speech rate in large vocabulary speech recognition systems , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[17] Jithendra Vepa,et al. Using posterior-based features in template matching for speech recognition , 2006, INTERSPEECH.

[18] Samy Bengio,et al. Discriminative keyword spotting , 2009, Speech Commun..