The TAO of ATWV: Probing the mysteries of keyword search performance

In this paper we apply diagnostic analysis to gain a deeper understanding of the performance of the the keyword search system that we have developed for conversational telephone speech in the IARPA Babel program. We summarize the Babel task, its primary performance metric, “actual term weighted value” (ATWV), and our recognition and keyword search systems. Our analysis uses two new oracle ATWV measures, a bootstrap-based ATWV confidence interval, and includes a study of the underpinnings of the large ATWV gains due to system combination. This analysis quantifies the potential ATWV gains from improving the number of true hits and the overall quality of the detection scores in our system's posting lists. It also shows that system combination improves our systems' ATWV via a small increase in the number of true hits in the posting lists.

[1]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[2]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[3]  Daniel Povey,et al.  A symmetrization of the Subspace Gaussian Mixture Model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[6]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[7]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[8]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[9]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[10]  Daniel P. W. Ellis,et al.  Noise Robust Pitch Tracking by Subband Autocorrelation Classification , 2012, INTERSPEECH.

[11]  Daniel Povey,et al.  Revisiting semi-continuous hidden Markov models , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Van Hai Do,et al.  A study on LVCSR and keyword search for tagalog , 2013, INTERSPEECH.

[13]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[15]  Fabio Valente,et al.  Hierarchical neural networks feature extraction for LVCSR system , 2007, INTERSPEECH.

[16]  Xiaodong Cui,et al.  A high-performance Cantonese keyword search system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Andreas Stolcke,et al.  The SRI/OGI 2006 spoken term detection system , 2007, INTERSPEECH.

[18]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.