Learning sub-word units and exploiting contextual information for open vocabulary speech recognition

Large vocabulary continuous speech recognition (LVCSR) systems fail to recognized words beyond their vocabulary, many of which are information rich terms such as named entities, technical terms, or foreign words. Mis-recognizing these Out-of-Vocabulary (OOV) words can have a disproportionate impact in transcript coherence, and cause recognition failures which propagate through pipeline systems, impacting the performance of downstream applications. Ideally, a speech recognition system would be able to recognize arbitrary, even previously unseen, words. This dissertation presents an approach to recover from failures caused by OOVs by automatically identifying when OOVs are spoken and transcribing them using sub-lexical units. This results in a hybrid word/sub-word system which predicts full-words for in-vocabulary terms and sub-lexical units for OOVs. We first present an approach to model OOVs using sub-lexical units automatically learned from data. The learned units are variable-length phone sequences, which are included in the recognizer's vocabulary and language model. Previous work heuristically creates the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. Instead, we propose a novel unsupervised approach to learn the sub-word lexicon optimized for a given task. This approach employs a log-linear model with overlapping features to learn multi-phone units obtained by segmenting the phonetic representation of a corpus. OOV Detection is the task of identifying regions in the recognizer's output where out-of-vocabulary words were uttered. The detection of OOV regions is helpful to avoid error propagation to downstream applications such as machine translation, named entity recognition, and spoken document retrieval. We combine the proposed hybrid system with confidence based metrics to improve OOV detection performance. Previous work address OOV detection as a binary classification task, where each region is independently classified using local information. This dissertation treats this problem as a sequence labeling problem, and shows that (1) jointly predicting out-of-vocabulary regions, (2) including contextual information from each region, and (3) learning sub-lexical units optimized for this task, leads to substantial improvements with respect to state-of-the-an systems. The resulting sub-word representation and OOV detector is helpful to recover the correct spelling of new words, resulting in an open-vocabulary system; and improves performance in downstream applications strongly affected by out-of-vocabulary terms, such as: spoken term detection and named entity recognition in speech.

[1]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[2]  Timothy J. Hazen,et al.  A comparison and combination of methods for OOV word detection and word confidence scoring , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[3]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 shared task , 2003 .

[4]  Jacob Cohen,et al.  The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .

[5]  Karen Spärck Jones,et al.  Effects of out of vocabulary words in spoken document retrieval (poster session) , 2000, SIGIR '00.

[6]  Nir Friedman,et al.  Probabilistic Graphical Models , 2009, Data-Driven Computational Neuroscience.

[7]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[8]  Monika Woszczyna,et al.  Detection and transcription of new words , 1993, EUROSPEECH.

[9]  Mark Dredze,et al.  A spoken term detection framework for recovering out-of-vocabulary words using the web , 2010, INTERSPEECH.

[10]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[11]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[12]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[13]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[14]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[15]  Bhuvana Ramabhadran,et al.  Effect of pronounciations on OOV queries in spoken term detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Lucian Galescu Recognition of out-of-vocabulary words with sub-lexical language models , 2003, INTERSPEECH.

[17]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Geoffrey Zweig,et al.  A segmental CRF approach to large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19]  J. Rissanen Stochastic Complexity in Statistical Inquiry Theory , 1989 .

[20]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[21]  Mark Dredze,et al.  Learning Sub-Word Units for Open Vocabulary Speech Recognition , 2011, ACL.

[22]  Mark Dredze,et al.  Contextual Information Improves OOV Detection in Speech , 2010, NAACL.

[23]  Chris Callison-Burch,et al.  Creating Speech and Language Data With Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[24]  Cyril Allauzen,et al.  General Indexation of Weighted Automata - Application to Spoken Utterance Retrieval , 2004, HLT-NAACL 2004.

[25]  Mathias Creutz,et al.  Web Augmentation of Language Models for Continuous Speech Recognition of SMS Text Messages , 2009, EACL.

[26]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[27]  Andreas Stolcke,et al.  Using Conditional Random Fields for Sentence Boundary Detection in Speech , 2005, ACL.

[28]  Mary P. Harper,et al.  A Joint Language Model With Fine-grain Syntactic Tags , 2009, EMNLP.

[29]  Hynek Hermansky,et al.  Combination of strongly and weakly constrained recognizers for reliable detection of OOVS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[31]  Bhuvana Ramabhadran,et al.  Balancing false alarms and hits in Spoken Term Detection , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Geoffrey Zweig,et al.  Advances in speech transcription at IBM under the DARPA EARS program , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Frédéric Béchet,et al.  Robust Named Entity Extraction from Large Spoken Archives , 2005, HLT/EMNLP.

[34]  Stanley Xinlei Wang,et al.  Using graphone models in automatic speech recognition , 2009 .

[35]  Richard M. Schwartz,et al.  Unsupervised acoustic and language model training with small amounts of labelled data , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Hui Sun,et al.  Using word confidence measure for OOV words detection in a spontaneous spoken dialog system , 2003, INTERSPEECH.

[37]  Richard M. Schwartz,et al.  Automatic Detection Of New Words In A Large Vocabulary Continuous Speech Recognition System , 1989, HLT.

[38]  David Carmel,et al.  Spoken document retrieval from call-center conversations , 2006, SIGIR.

[39]  James K. Baker,et al.  Stochastic modeling for automatic speech understanding , 1990 .

[40]  Frederick Jelinek,et al.  Classifying words for improved statistical language models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[41]  Kenney Ng,et al.  Subword-based approaches for spoken document retrieval , 2000, Speech Commun..

[42]  Katsuhito Sudoh,et al.  Incorporating Speech Recognition Confidence into Discriminative Named Entity Recognition of Speech Data , 2006, ACL.

[43]  Lalit R. Bahl,et al.  Decoding for channels with insertions, deletions, and substitutions with applications to speech recognition , 1975, IEEE Trans. Inf. Theory.

[44]  Ghinwa F. Choueiter Linguistically-motivated sub-word modeling with applications to speech recognition , 2008 .

[45]  Hermann Ney,et al.  Open vocabulary speech recognition with flat hybrid models , 2005, INTERSPEECH.

[46]  Jordan Cohen,et al.  Vocal tract normalization in speech recognition: Compensating for systematic speaker variability , 1995 .

[47]  Murat Saraclar,et al.  Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[48]  Mathias Creutz,et al.  Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 , 2005 .

[49]  Michiel Bacchiani,et al.  Restoring punctuation and capitalization in transcribed speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[50]  Georges Linarès,et al.  Using the World Wide Web for Learning New Words in Continuous Speech Recognition Tasks: Two Case Studies , 2009 .

[51]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[52]  I. Lee Hetherington A characterization of the problem of new, out-of-vocabulary words in continuous-speech recognition and understanding , 1995 .

[53]  Ronald Rosenfeld,et al.  Optimizing lexical and N-gram coverage via judicious use of linguistic data , 1995, EUROSPEECH.

[54]  Brian Kingsbury,et al.  The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[55]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[56]  Ebru Arisoy,et al.  Turkish Broadcast News Transcription and Retrieval , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[57]  Bhuvana Ramabhadran,et al.  A new method for OOV detection using hybrid word/fragment system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[58]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[59]  Stephanie Seneff,et al.  Two-pass strategy for handling OOVs in a large vocabulary recognition task , 2005, INTERSPEECH.

[60]  Siddika Parlak,et al.  Spoken term detection for Turkish Broadcast News , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[61]  Mehryar Mohri,et al.  Weighted Automata Algorithms , 2009 .

[62]  Stanley F. Chen,et al.  Conditional and joint models for grapheme-to-phoneme conversion , 2003, INTERSPEECH.

[63]  Geoffrey Zweig,et al.  Confidence estimation, OOV detection and language ID using phone-to-word transduction and phone-level alignments , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[64]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[65]  James Glass,et al.  Modelling out-of-vocabulary words for robust speech recognition , 2002 .

[66]  Bernard Mérialdo,et al.  A Dynamic Language Model for Speech Recognition , 1991, HLT.

[67]  Hoifung Poon,et al.  Unsupervised Morphological Segmentation with Log-Linear Models , 2009, NAACL.

[68]  Beth Logan,et al.  An experimental study of an audio indexing system for the web , 2000, INTERSPEECH.

[69]  Hakan Erdogan,et al.  Incremental on-line feature space MLLR adaptation for telephony speech recognition , 2002, INTERSPEECH.

[70]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[71]  Mei-Yuh Hwang,et al.  Web-data augmented language models for Mandarin conversational speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[72]  Andreas Stolcke,et al.  Open-vocabulary spoken term detection using graphone-based hybrid recognition systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[73]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[74]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[75]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[76]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[77]  James R. Glass,et al.  Modeling out-of-vocabulary words for robust speech recognition , 2000, INTERSPEECH.

[78]  Jeremy Morris,et al.  Discriminative Phonetic Recognition with Conditional Random Fields , 2006 .

[79]  Roberto Basili,et al.  Natural language processing in the web era , 2012, Intelligenza Artificiale.

[80]  Walter Daelemans,et al.  Transcription of out-of-vocabulary words in large vocabulary speech recognition based on phoneme-to-grapheme conversion , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[81]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[82]  Ebru Arisoy,et al.  Analysis of Morph-Based Speech Recognition and the Modeling of Out-of-Vocabulary Words Across Languages , 2007, HLT-NAACL.

[83]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[84]  Dietrich Klakow,et al.  OOV-detection in large vocabulary system using automatically defined word-fragments as fillers , 1999, EUROSPEECH.

[85]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[86]  James R. Glass,et al.  Analysis and Processing of Lecture Audio Data: Preliminary Investigations , 2004, Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004 - SpeechIR '04.

[87]  George Saon,et al.  Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[88]  James R. Glass,et al.  Learning units for domain-independent out-of- vocabulary word modelling , 2001, INTERSPEECH.

[89]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[90]  Bhuvana Ramabhadran,et al.  Web derived pronunciations for spoken term detection , 2009, SIGIR.

[91]  Eric Fosler-Lussier,et al.  Combining phonetic attributes using conditional random fields , 2006, INTERSPEECH.

[92]  Simon King,et al.  Named entity extraction from word lattices , 2003, INTERSPEECH.

[93]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[94]  Thomas Schaaf Detection of OOV words using generalized word models and a semantic class language model , 2001, INTERSPEECH.

[95]  Eric Fosler-Lussier,et al.  Further Experiments with Detector-Based Conditional Random Fields in Phonetic Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[96]  Noah A. Smith,et al.  Contrastive Estimation: Training Log-Linear Models on Unlabeled Data , 2005, ACL.

[97]  Victor Zue,et al.  Phonological parsing for reversible letter-to-sound/sound-to-letter generation , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[98]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[99]  Frédéric Bimbot,et al.  Variable-length sequence matching for phonetic transcription using joint multigrams , 1995, EUROSPEECH.

[100]  A. Waibel,et al.  Multilingual named entity extraction and translation from text and speech , 2006 .

[101]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[102]  Sadaoki Furui,et al.  Why Is the Recognition of Spontaneous Speech so Hard? , 2005, TSD.

[103]  Yaser Al-Onaizan,et al.  Translation with Finite-State Devices , 1998, AMTA.

[104]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[105]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[106]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[107]  Alex Acero,et al.  Maximum Entropy Confidence Estimation for Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[108]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[109]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[110]  Sara Stymne,et al.  Vs and OOVs: Two Problems for Translation between German and English , 2010, WMT@ACL.

[111]  Mark Dredze,et al.  Annotating Named Entities in Twitter Data with Crowdsourcing , 2010, Mturk@HLT-NAACL.

[112]  Brian Kingsbury,et al.  Fast decoding for open vocabulary spoken term detection , 2009, HLT-NAACL.

[113]  Mary P. Harper,et al.  Measuring tagging performance of a joint language model , 2009, INTERSPEECH.

[114]  Bhuvana Ramabhadran,et al.  Vocabulary independent spoken term detection , 2007, SIGIR.

[115]  Timothy J. Hazen,et al.  Pronunciation modeling using a finite-state transducer representation , 2005, Speech Commun..

[116]  Bhuvana Ramabhadran,et al.  Towards using hybrid word and fragment units for vocabulary independent LVCSR systems , 2009, INTERSPEECH.

[117]  J. Baker,et al.  The DRAGON system--An overview , 1975 .

[118]  James R. Glass,et al.  Heterogeneous lexical units for automatic speech recognition: preliminary investigations , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[119]  Hui Lin,et al.  OOV detection by joint word/phone lattice alignment , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[120]  Fernando Pereira,et al.  Weighted Automata in Text and Speech Processing , 2005, ArXiv.

[121]  Hagai Aronowitz Online vocabulary adaptation using contextual information and information retrieval , 2008, INTERSPEECH.

[122]  Mary P. Harper,et al.  Self-Training PCFG Grammars with Latent Annotations Across Languages , 2009, EMNLP.

[123]  Lalit R. Bahl,et al.  Design of a linguistic statistical decoder for the recognition of continuous speech , 1975, IEEE Trans. Inf. Theory.