Direct posterior confidence for out-of-vocabulary spoken term detection

Spoken term detection (STD) is a key technology for spoken information retrieval. As compared to the conventional speech transcription and keyword spotting, STD is an open-vocabulary task and has to address out-of-vocabulary (OOV) terms. Approaches based on subword units, for example phones, are widely used to solve the OOV issue; however, performance on OOV terms is still substantially inferior to that of in-vocabulary (INV) terms. The performance degradation on OOV terms can be attributed to a multitude of factors. One particular factor we address in this article is the unreliable confidence estimation caused by weak acoustic and language modeling due to the absence of OOV terms in the training corpora. We propose a direct posterior confidence derived from a discriminative model, such as multilayer perceptron (MLP). The new confidence considers a wide-range acoustic context which is usually important for speech recognition and retrieval; moreover, it localizes on detected speech segments and therefore avoids the impact of long-span word context which is usually unreliable for OOV term detection. In this article, we first develop an extensive discussion about the modeling weakness problem associated with OOV terms, and then propose our approach to address this problem based on direct poster confidence. Our experiments carried out on spontaneous and conversational multiparty meeting speech, demonstrate that the proposed technique provides a significant improvement in STD performance as compared to conventional lattice-based confidence, in particular for OOV terms. Furthermore, the new confidence estimation approach is fused with other advanced techniques for OOV treatment, such as stochastic pronunciation modeling and discriminative confidence normalization. This leads to an integrated solution for OOV term detection that results in a large performance improvement.

[1]  Kenji Iwata,et al.  Robust spoken term detection using combination of phone-based and word-based recognition , 2008, INTERSPEECH.

[2]  Simon King,et al.  Stochastic Pronunciation Modeling for Out-of-Vocabulary Spoken Term Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Mitchel Weintraub,et al.  LVCSR log-likelihood ratio scoring for keyword spotting , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[5]  Andreas Stolcke,et al.  The SRI/OGI 2006 spoken term detection system , 2007, INTERSPEECH.

[6]  Murat Saraclar,et al.  Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Frédéric Bimbot,et al.  Variable-length sequence matching for phonetic transcription using joint multigrams , 1995, EUROSPEECH.

[8]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[9]  Sridha Sridharan,et al.  Optimising Figure of Merit for phonetic spoken term detection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Jay G. Wilpon,et al.  A two pass classifier for utterance rejection in keyword spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Steve Renals,et al.  Retrieval of broadcast news documents with the THISL system , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12]  Kenney Ng Towards robust methods for spoken document retrieval , 1998, ICSLP.

[13]  Martha Larson,et al.  Contextual verification for open vocabulary spoken term detection , 2010, INTERSPEECH.

[14]  Rong Zhang,et al.  Word level confidence annotation using combinations of features , 2001, INTERSPEECH.

[15]  Herbert Gish,et al.  Evaluation of word confidence for speech recognition systems , 1999, Comput. Speech Lang..

[16]  Sridha Sridharan,et al.  A phonetic search approach to the 2006 NIST spoken term detection evaluation , 2007, INTERSPEECH.

[17]  Karen Spärck Jones,et al.  Robust talker-independent audio document retrieval , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[18]  Rafid A. Sukkar,et al.  Subword-based minimum verification error (SB-MVE) training for task independent utterance verification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[19]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[20]  Peng Yu,et al.  A hybrid word / phoneme-based approach for improved vocabulary-independent search in spontaneous speech , 2004, INTERSPEECH.

[21]  Andreas Stolcke,et al.  Open-vocabulary spoken term detection using graphone-based hybrid recognition systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Hiromitsu Nishizaki,et al.  Japanese spoken term detection using syllable transition network derived from multiple speech recognizers' outputs , 2010, INTERSPEECH.

[23]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[24]  Victor Zue,et al.  A segment-based wordspotter using phonetic filler models , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Michael J. Witbrock,et al.  Using words and phonetic strings for efficient information retrieval from imperfectly transcribed spoken documents , 1997, DL '97.

[26]  Bhuvana Ramabhadran,et al.  Multilingual Spoken Term Detection: Finding and Testing New Pronunciations , 2008 .

[27]  Dong Wang,et al.  Term-Dependent Confidence Normalisation for Out-of-Vocabulary Spoken Term Detection , 2012, Journal of Computer Science and Technology.

[28]  Simon King,et al.  Stochastic pronunciation modelling for spoken term detection , 2009, INTERSPEECH.

[29]  Mikko Kurimo,et al.  Indexing confusion networks for morph-based spoken document retrieval , 2007, SIGIR.

[30]  Lin-Shan Lee,et al.  Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping , 2010, INTERSPEECH.

[31]  Hervé Bourlard,et al.  Improving posterior based confidence measures in hybrid HMM/ANN speech recognition systems , 1998, ICSLP.

[32]  Biing-Hwang Juang,et al.  Robust utterance verification for connected digits recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[33]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Siddika Parlak,et al.  Spoken term detection for Turkish Broadcast News , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Javier Tejedor,et al.  Novel methods for query selection and query combination in query-by-example spoken term detection , 2010, SSCS '10.

[36]  Dragutin Petkovic,et al.  Phonetic confusion matrix based spoken document retrieval , 2000, SIGIR '00.

[37]  Chalapathy Neti,et al.  Word-based confidence measures as a guide for stack search in speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[39]  Bhuvana Ramabhadran,et al.  Vocabulary independent spoken term detection , 2007, SIGIR.

[40]  Weiqiang Zhang,et al.  Combining Chinese spoken term detection systems via side-information conditioned linear logistic regression , 2010, INTERSPEECH.

[41]  Chin-Hui Lee,et al.  Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition , 1996, IEEE Trans. Speech Audio Process..

[42]  Steve J. Young,et al.  A fast lattice-based approach to vocabulary independent wordspotting , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[43]  Mark A. Clements,et al.  Phonetic Searching vs. LVCSR: How to Find What You Really Want in Audio Archives , 2002, Int. J. Speech Technol..

[44]  Ashish Verma,et al.  Keyword Search using Modified Minimum Edit Distance Measure , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[45]  Mark Dredze,et al.  A spoken term detection framework for recovering out-of-vocabulary words using the web , 2010, INTERSPEECH.

[46]  Peng Yu,et al.  Vocabulary-independent search in spontaneous speech , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[47]  Bin Ma,et al.  A phonotactic-semantic paradigm for automatic spoken document classification , 2005, SIGIR '05.

[48]  Richard Sproat,et al.  Lattice-Based Search for Spoken Utterance Retrieval , 2004, NAACL.

[49]  Tomoyosi Akiba,et al.  Metric subspace indexing for fast spoken term detection , 2010, INTERSPEECH.

[50]  Stanley F. Chen,et al.  Conditional and joint models for grapheme-to-phoneme conversion , 2003, INTERSPEECH.

[51]  Biing-Hwang Juang,et al.  A training procedure for verifying string hypotheses in continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[52]  Jia Liu,et al.  Fusing multiple systems into a compact lattice index for chinese spoken term detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[53]  S. R. Mahadeva Prasanna,et al.  Fast Approximate Spoken Term Detection from Sequence of Phonemes , 2008, SIGIR 2008.

[54]  David A. James,et al.  A system for unrestricted topic retrieval from radio news broadcasts , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[55]  Kari Torkkola An efficient way to learn English grapheme-to-phoneme rules automatically , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[56]  Herbert Gish,et al.  Large vocabulary word scoring as a basis for transcription generation , 1995, EUROSPEECH.

[57]  Biing-Hwang Juang,et al.  Discriminative utterance verification for connected digits recognition , 1995, IEEE Trans. Speech Audio Process..

[58]  Sridha Sridharan,et al.  Dynamic match phone-lattice searches for very fast and accurate unrestricted vocabulary keyword spotting , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[59]  Dong Wang,et al.  Handling overlaps in spoken term detection , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  R. I. Damper,et al.  Stochastic phonographic transduction for English , 1996, Comput. Speech Lang..

[61]  Simon King,et al.  Stochastic pronunciation modelling and soft match for out-of-vocabulary spoken term detection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[62]  Gustavo Hernández Ábrego Confidence measures for speech recognition and utterance verification , 2000 .

[63]  Shi-wook Lee,et al.  Two-stage vocabulary-free spoken document retrieval - subword identification and re-recognition of the identified sections , 2006, Interspeech.

[64]  Wayne H. Ward,et al.  A senone based confidence measure for speech recognition , 1997, EUROSPEECH.

[65]  Beth Logan,et al.  Approaches to reduce the effects of OOV queries on indexed spoken audio , 2005, IEEE Transactions on Multimedia.

[66]  George Zavaliagkos,et al.  A hybrid segmental neural net/hidden Markov model system for continuous speech recognition , 1994, IEEE Trans. Speech Audio Process..

[67]  Timothy J. Hazen,et al.  Word and phone level acoustic confidence scoring , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[68]  Karen Spärck Jones,et al.  Retrieving spoken documents by combining multiple index sources , 1996, SIGIR '96.

[69]  Hervé Bourlard,et al.  Continuous speech recognition , 1995, IEEE Signal Process. Mag..

[70]  Beth Logan,et al.  An experimental study of an audio indexing system for the web , 2000, INTERSPEECH.

[71]  Dong Wang,et al.  Out-of-Vocabulary Spoken Term Detection , 2010 .

[72]  Kenneth Ward Church,et al.  Towards spoken term discovery at scale with zero resources , 2010, INTERSPEECH.

[73]  Laurent Miclet,et al.  Rejection of extraneous input in speech recognition applications, using multi-layer perceptrons and the trace of HMMs , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[74]  Lin-Shan Lee,et al.  Integrating recognition and retrieval with user feedback: A new framework for spoken term detection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[75]  Peter Schäuble,et al.  New techniques for open-vocabulary spoken document retrieval , 1998, SIGIR '98.

[76]  Dong Wang,et al.  Direct posterior confidence for out-of-vocabulary spoken term detection , 2010, TOIS.

[77]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[78]  Sheryl R. Young,et al.  Detecting misrecognitions and out-of-vocabulary words , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[79]  Ralf Schlüter,et al.  Using word probabilities as confidence measures , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[80]  Michael Cohen,et al.  A phone-dependent confidence measure for utterance rejection , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[81]  D. Watson Death Sentence: The Decay of Public Language , 2003 .

[82]  Paul Taylor,et al.  Hidden Markov models for grapheme to phoneme conversion , 2005, INTERSPEECH.

[83]  Walter Daelemans,et al.  Forgetting Exceptions is Harmful in Language Learning , 1998, Machine Learning.

[84]  Herbert Gish,et al.  Improved estimation, evaluation and applications of confidence measures for speech recognition , 1997, EUROSPEECH.

[85]  Lukás Burget,et al.  Comparison of keyword spotting approaches for informal continuous speech , 2005, INTERSPEECH.

[86]  Simon King,et al.  Growing bottleneck features for tandem ASR , 2008, INTERSPEECH.

[87]  Dong Wang,et al.  A comparison of phone and grapheme-based spoken term detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[88]  W. Russell,et al.  Continuous hidden Markov modeling for speaker-independent word spotting , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[89]  Dong Wang,et al.  Posterior-based confidence measures for spoken term detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[90]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[91]  Steve Renals,et al.  Confidence measures from local posterior probability estimates , 1999, Comput. Speech Lang..

[92]  R. Damper,et al.  Pronunciation by Analogy: Impact of Implementational Choices on Performance , 1997 .

[93]  Bhuvana Ramabhadran,et al.  Phonetic query expansion for spoken document retrieval , 2008, INTERSPEECH.

[94]  Stephen J. Cox,et al.  Confidence measures for the SWITCHBOARD database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[95]  Sherif Abdou,et al.  Beam search pruning in speech recognition using a posterior probability-based confidence measure , 2004, Speech Commun..

[96]  Peter Regel-Brietzmann,et al.  Word graph rescoring using confidence measures , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[97]  Simon King,et al.  Term-dependent confidence for out-of-vocabulary term detection , 2009, INTERSPEECH.

[98]  Gunnar Evermann,et al.  Large vocabulary decoding and confidence estimation using word posterior probabilities , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[99]  Katsuhito Sudoh,et al.  Discriminative named entity recognition of speech data using speech recognition confidence , 2006, INTERSPEECH.

[100]  Lin Lawrence Chase,et al.  Word and acoustic confidence annotation for large vocabulary speech recognition , 1997, EUROSPEECH.

[101]  Hervé Bourlard,et al.  Iterative Posterior-Based Keyword Spotting Without Filler Models , 1999 .

[102]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[103]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[104]  Lukás Burget,et al.  Sub-word modeling of out of vocabulary words in spoken term detection , 2008, 2008 IEEE Spoken Language Technology Workshop.

[105]  Douglas W. Oard,et al.  Combining LVCSR and vocabulary-independent ranked utterance retrieval for robust speech search , 2009, SIGIR.

[106]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[107]  Jia Liu,et al.  A study of lattice-based spoken term detection for Chinese spontaneous speech , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[108]  Karen Spärck Jones,et al.  Effects of out of vocabulary words in spoken document retrieval (poster session) , 2000, SIGIR '00.

[109]  Harald Höge,et al.  A new keyword spotting algorithm with pre-calculated optimal thresholds , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[110]  Lukás Burget,et al.  The AMI Meeting Transcription System: Progress and Performance , 2006, MLMI.

[111]  Rafid A. Sukkar,et al.  Correcting recognition errors via discriminative utterance verification , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[112]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[113]  Mitch Weintraub,et al.  Neural-network based measures of confidence for word recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[114]  Kenney Ng,et al.  Subword-based approaches for spoken document retrieval , 2000, Speech Commun..

[115]  Alan W. Black,et al.  Issues in building general letter to sound rules , 1998, SSW.

[116]  Samy Bengio,et al.  Posterior based keyword spotting with a priori thresholds , 2006, INTERSPEECH.

[117]  Hermann Ney,et al.  Multigram-based grapheme-to-phoneme conversion for LVCSR , 2003, INTERSPEECH.

[118]  Bhuvana Ramabhadran,et al.  Effect of pronounciations on OOV queries in spoken term detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[119]  Thomas Schaaf,et al.  Confidence measures for spontaneous speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[120]  Beth Logan,et al.  Word and sub-word indexing approaches for reducing the effects of OOV queries on spoken audio , 2002 .

[121]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[122]  Fabio Valente,et al.  English spoken term detection in multilingual recordings , 2010, INTERSPEECH.

[123]  Chin-Hui Lee,et al.  Utterance verification of keyword strings using word-based minimum verification error (WB-MVE) training , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[124]  Larry Gillick,et al.  A probabilistic approach to confidence estimation and evaluation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.