Predicting search term reliability for spoken term detection systems

Spoken term detection is an extension of text-based searching that allows users to type keywords and search audio files containing recordings of spoken language. Performance is dependent on many external factors such as the acoustic channel, language, pronunciation variations and acoustic confusability of the search term. Unlike text-based searches, the likelihoods of false alarms and misses for specific search terms, which we refer to as reliability, play a significant role in the overall perception of the usability of the system. In this paper, we present a system that predicts the reliability of a search term based on its inherent confusability. Our approach integrates predictors of the reliability that are based on both acoustic and phonetic features. These predictors are trained using an analysis of recognition errors produced from a state of the art spoken term detection system operating on the Fisher Corpus. This work represents the first large-scale attempt to predict the success of a keyword search term from only its spelling. We explore the complex relationship between phonetic and acoustic properties of search terms. We show that a 76 % correlation between the predicted error rate and the actual measured error rate can be achieved, and that the remaining confusability is due to other acoustic modeling issues that cannot be derived from a search term’s spelling.

[1]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[2]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[3]  Darrin Duford,et al.  crep: a regular expression-matching textual corpus tool , 1993 .

[4]  James Kennedy,et al.  Particle swarm optimization , 2002, Proceedings of ICNN'95 - International Conference on Neural Networks.

[5]  Amir Hossein Harati Nejad Torbati,et al.  Assessing search term strength in spoken term detection , 2013, 2013 IEEE International Multi-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA).

[6]  David W. Aha,et al.  A Comparative Evaluation of Sequential Feature Selection Algorithms , 1995, AISTATS.

[7]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[8]  R. J. Lickley,et al.  Proceedings of the International Conference on Spoken Language Processing. , 1992 .

[9]  Andries Petrus Engelbrecht,et al.  A new particle swarm optimiser for linearly constrained optimisation , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[10]  Joseph Picone,et al.  Resegmentation of SWITCHBOARD , 1998, ICSLP.

[11]  Carina Silberer,et al.  Proceedings of the International Conference on Language Resources and Evaluation (LREC) , 2008 .

[12]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[13]  Johan A. K. Suykens,et al.  Fixed-size kernel logistic regression for phoneme classification , 2007, INTERSPEECH.

[14]  James R. Hopgood,et al.  Nonconcurrent multiple speakers tracking based on extended Kalman particle filter , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Joseph Picone,et al.  Phone-mediated word alignment for speech recognition evaluation , 1990, IEEE Trans. Acoust. Speech Signal Process..

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[18]  Anne H. Anderson,et al.  Proceedings of Eurospeech , 2003, ISCA 2003.

[19]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Doug Fisher,et al.  Learning from Data: Artificial Intelligence and Statistics V , 1996 .

[21]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[22]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[23]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[24]  Rodney W. Johnson,et al.  Automatic translation of english text to phonetics by means of letter-to-sound rules (nrl report 794 , 1976 .

[25]  Beatrice Gralton,et al.  Washington DC - USA , 2008 .

[26]  Steve Young,et al.  The HTK book , 1995 .