Calibration and multiple system fusion for spoken term detection using linear logistic regression

State-of-the-art calibration and fusion approaches for spoken term detection (STD) systems currently rely on a multi-pass approach where the scores are calibrated, then fused, and finally re-calibrated to obtain a single decision threshold across keywords. While the above techniques are theoretically correct, they rely on meta-parameter tuning and are prone to over-fitting. This study presents an efficient and effective score calibration technique for keyword detection that is based on the logistic regression calibration approach commonly used in forensic speaker identification. The technique applies seamlessly to both single systems and to system fusion, and enables optimization for specific keyword detection evaluation functions. We run experiments on a Vietnamese STD task, comparing the technique with more empirical calibration and fusion schemes and demonstrate that we can achieve comparable or better performance in terms of the NIST ATWV metric with a more elegant solution.

[1]  Arindam Mandal,et al.  Normalized amplitude modulation features for large vocabulary noise-robust speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Murat Saraclar,et al.  Lattice Indexing for Spoken Term Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[4]  Andreas Stolcke,et al.  Data-driven lexicon expansion for Mandarin broadcast news and conversation speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Yun Lei,et al.  Strategies for high accuracy keyword detection in noisy channels , 2013, INTERSPEECH.

[7]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[8]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[9]  Nelson Morgan,et al.  The TAO of ATWV: Probing the mysteries of keyword search performance , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[10]  Mikko Kurimo,et al.  Efficient estimation of maximum entropy language models with n-gram features: an SRILM extension , 2010, INTERSPEECH.

[11]  Wen Wang,et al.  Rich system combination for keyword spotting in noisy and acoustically heterogeneous audio streams , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  OpenKWS 13 Keyword Search Evaluation Plan 1 , 2013 .

[13]  Mei-Yuh Hwang,et al.  Improved tone modeling for Mandarin broadcast news speech recognition , 2006, INTERSPEECH.

[14]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[16]  Andreas Stolcke,et al.  Improved discriminative training using phone lattices , 2005, INTERSPEECH.

[17]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[18]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .

[19]  Andreas Stolcke,et al.  Voicing feature integration in SRI's decipher LVCSR system , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.