Learning Out-of-Vocabulary Words in Automatic Speech Recognition

Out-of-vocabulary (OOV) words are unknown words that appear in the testing speech but not in the recognition vocabulary. They are usually important content words such as names and locations which contain information crucial to the success of many speech recognition tasks. However, most speech recognition systems are closed-vocabulary recognizers that only recognize words in a fixed finite vocabulary. When there are OOV words in the testing speech, such systems cannot identify OOV words, but misrecognize them as in-vocabulary (IV) words. Furthermore, the errors made on OOV words also affect the recognition accuracy of their surrounding IV words. Therefore, speech recognition systems in which OOV words can be detected and recovered are of great interest. As simply applying a large vocabulary in a recognizer cannot solve the OOV word problem, several alternative approaches had been proposed. One is to use a hybrid lexicon and hybrid language model which incorporate both word and sublexical units during decoding. Another popular OOV word detection method is to locate where the word decoding and the phone decoding results are in disagreement. Some other methods involve with a classification process to find possible OOV words using confidence scores and other evidence. For OOV word recovery, the phoneme-to-grapheme (P2G) conversion is usually applied to predict the written form of an OOV word. Current OOV research focuses on detecting the presence of OOV words in the testing speech. There is only limited work about how to convert OOV words into IV words of a recognizer. In this thesis, we therefore investigated learning OOV words in speech recognition. We showed that it is feasible for a recognizer to automatically learn new words and operate on an open vocabulary. Specifically, we built an OOV word learning framework which consists of three major components. The first component is OOV word detection, where we built hybrid systems using different sub-lexical units to detect OOV words during decoding. We also studied to improve the hybrid system performance using system combination and OOV word classification techniques. Since OOV words can appear more than once in a conversation or over a period of time, in the OOV word clustering component, we worked on finding multiple instances of the same OOV word. At last, in OOV word recovery, we explored how to integrate identified OOV words into the recognizer’s lexicon and language model. The proposed work was tested on tasks with different speaking styles and recording conditions including the Wall Street Journal (WSJ), Broadcast News (BN), and Switchboard (SWB) datasets. Our experimental results show that we are able to detect and recover up to 40% OOV words using the proposed OOV word learning framework. Finally, a self-learning speech recognition system will be more robust and has broader applications in real life.

[1]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[2]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[3]  Murat Saraclar,et al.  Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Keith Vertanen Combining open vocabulary recognition and word confusion networks , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[7]  Stanley F. Chen,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[8]  Benoît Maison,et al.  Automatic baseform generation from acoustic data , 2003, INTERSPEECH.

[9]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[10]  Xuedong Huang,et al.  Semi-continuous hidden Markov models for speech recognition , 1989 .

[11]  Hui Sun,et al.  Using word confidence measure for OOV words detection in a spontaneous spoken dialog system , 2003, INTERSPEECH.

[12]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[13]  T. K. Vintsyuk Speech discrimination by dynamic programming , 1968 .

[14]  Alexander I. Rudnicky,et al.  System combination for out-of-vocabulary word detection , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[16]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[17]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[18]  Dietrich Klakow,et al.  OOV-detection in large vocabulary system using automatically defined word-fragments as fillers , 1999, EUROSPEECH.

[19]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[20]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[21]  Timothy J. Hazen,et al.  Query-by-example spoken term detection using phonetic posteriorgram templates , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[22]  Hsiao-Wuen Hon,et al.  An overview of the SPHINX speech recognition system , 1990, IEEE Trans. Acoust. Speech Signal Process..

[23]  Mei Hwang Subphonetic Acoustic Modeling for Speaker-Independent Continuous Speech Recognition , 2001 .

[24]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[25]  Hermann Ney,et al.  Hybrid Language Models Using Mixed Types of Sub-Lexical Units for Open Vocabulary German LVCSR , 2011, INTERSPEECH.

[26]  Lucian Galescu Recognition of out-of-vocabulary words with sub-lexical language models , 2003, INTERSPEECH.

[27]  Hermann Ney,et al.  A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..

[28]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[29]  Alexander I. Rudnicky,et al.  OOV Word Detection using Hybrid Models with Mixed Types of Fragments , 2012, INTERSPEECH.

[30]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[31]  Alexander I. Rudnicky,et al.  Finding recurrent out-of-vocabulary words , 2013, INTERSPEECH.

[32]  Ashish Verma,et al.  Keyword Search using Modified Minimum Edit Distance Measure , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[33]  Ronald Rosenfeld,et al.  Optimizing lexical and N-gram coverage via judicious use of linguistic data , 1995, EUROSPEECH.

[34]  Thomas Schaaf Detection of OOV words using generalized word models and a semantic class language model , 2001, INTERSPEECH.

[35]  Alexander I. Rudnicky,et al.  OOV Detection and Recovery Using Hybrid Models with Different Fragments , 2011, INTERSPEECH.

[36]  Mark Dredze,et al.  A spoken term detection framework for recovering out-of-vocabulary words using the web , 2010, INTERSPEECH.

[37]  L. Baum,et al.  An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology , 1967 .

[38]  Michael Picheny,et al.  Automatic phonetic baseform determination , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[39]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[40]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[41]  Jithendra Vepa,et al.  Using posterior-based features in template matching for speech recognition , 2006, INTERSPEECH.

[42]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[43]  Lukás Burget,et al.  Similarity scoring for recognizing repeated out-of-vocabulary words , 2010, INTERSPEECH.

[44]  Mark Dredze,et al.  Contextual Information Improves OOV Detection in Speech , 2010, NAACL.

[45]  Alexander I. Rudnicky,et al.  Implementing and Improving MMIE Training in SphinxTrain , 2010 .

[46]  Hynek Hermansky,et al.  Combination of strongly and weakly constrained recognizers for reliable detection of OOVS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  Georges Linarès,et al.  Combined low level and high level features for out-of-vocabulary word detection , 2009, INTERSPEECH.

[48]  Kazuyo Tanaka,et al.  Detection of unknown words in large vocabulary speech recognition , 1993, EUROSPEECH.

[49]  Hermann Ney,et al.  Open vocabulary speech recognition with flat hybrid models , 2005, INTERSPEECH.

[50]  James R. Glass,et al.  Modeling out-of-vocabulary words for robust speech recognition , 2000, INTERSPEECH.

[51]  Hynek Hermansky,et al.  Posterior-based out of vocabulary word detection in telephone speech , 2009, INTERSPEECH.

[52]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[53]  Alexander I. Rudnicky,et al.  An architecture for scalable, universal speech recognition , 2011 .

[54]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[55]  Geoffrey Zweig,et al.  Confidence estimation, OOV detection and language ID using phone-to-word transduction and phone-level alignments , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[56]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[57]  James Glass,et al.  Modelling out-of-vocabulary words for robust speech recognition , 2002 .

[58]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[59]  Stanley F. Chen,et al.  Conditional and joint models for grapheme-to-phoneme conversion , 2003, INTERSPEECH.

[60]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[61]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[62]  Bhuvana Ramabhadran,et al.  Towards using hybrid word and fragment units for vocabulary independent LVCSR systems , 2009, INTERSPEECH.

[63]  J. Ajmera,et al.  Phonetic Distance Measures for Speech Recognition Vocabulary and Grammar Optimization , 2007 .

[64]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[65]  M. Cugmas,et al.  On comparing partitions , 2015 .

[66]  Hui Lin,et al.  OOV detection by joint word/phone lattice alignment , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[67]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[68]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[69]  Irina Illina,et al.  Detection of OOV words by combining acoustic confidence measures with linguistic features , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[70]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[71]  Alexander I. Rudnicky,et al.  The effect of lattice pruning on MMIE training , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[72]  Peder A. Olsen,et al.  Theory and practice of acoustic confusability , 2002, Comput. Speech Lang..

[73]  Bhuvana Ramabhadran,et al.  A new method for OOV detection using hybrid word/fragment system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[74]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .