Language Modeling for limited-data domains

With the increasing focus of speech recognition and natural language processing applications on domains with limited amount of in-domain training data, enhanced system performance often relies on approaches involving model adaptation and combination. In such domains, language models are often constructed by interpolating component models trained from partially matched corpora. Instead of simple linear interpolation, we introduce a generalized linear interpolation technique that computes context-dependent mixture weights from features that correlate with the component confidence and relevance for each n-gram context. Since the n-grams from partially matched corpora may not be of equal relevance to the target domain, we propose an n-gram weighting scheme to adjust the component n-gram probabilities based on features derived from readily- available corpus segmentation and metadata to de-emphasize out-of-domain n grams. In scenarios without any matched data for a development set, we examine unsupervised and active learning techniques for tuning the interpolation and weighting parameters. Results on a lecture transcription task using the proposed generalized linear interpolation and n-gram weighting techniques yield up to a 1.4% absolute word error rate reduction over a linearly-interpolated baseline language model. As more sophisticated models are only as useful as they are practical, we developed the MIT Language Modeling (MITLM) toolkit, designed for efficient iterative parameter optimization, and released it to the research community. With a compact vector-based n-gram data structure and optimized algorithm implementations, the toolkit not only improves the running time of common tasks by up to 40x, but also enables the efficient parameter tuning for language modeling techniques that were previously deemed impractical. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Bo-June Paul Hsu,et al.  Generalized linear interpolation of language models , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[2]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[3]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[4]  Jevtić Alon Orlitsky A Universal Compression Perspective of Smoothing Nikola , 2022 .

[5]  Joshua Goodman,et al.  Exponential Priors for Maximum Entropy Models , 2004, NAACL.

[6]  G. Y. Wong,et al.  The Hierarchical Logistic Regression Model for Multilevel Analysis , 1985 .

[7]  William H. Press,et al.  Numerical recipes , 1990 .

[8]  Brian Roark,et al.  Unsupervised language model adaptation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[9]  Ronald Rosenfeld,et al.  Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[11]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[12]  Dong Yu,et al.  Large-Margin Minimum Classification Error Training for Large-Scale Speech Recognition Tasks , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[13]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[15]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[16]  Panayiotis G. Georgiou,et al.  Building topic specific language models from webdata using competitive models , 2005, INTERSPEECH.

[17]  Geoffrey Zweig,et al.  Language modeling for voice search: A machine translation approach , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Shigeru Katagiri,et al.  Pervasive unsupervised adaptation for lecture speech transcription , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[19]  Jonathan G. Fiscus,et al.  Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[20]  Giuseppe Riccardi,et al.  On-line learning of language models with word error probability distributions , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[21]  Thomas Hofmann,et al.  Topic-based language models using EM , 1999, EUROSPEECH.

[22]  Francis Jack Smith,et al.  Language modelling with hierarchical domains , 1999, EUROSPEECH.

[23]  Dietrich Klakow,et al.  Log-linear interpolation of language models , 1998, ICSLP.

[24]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Sadaoki Furui,et al.  Unsupervised class-based language model adaptation for spontaneous speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[26]  Jun Wu,et al.  Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling , 2000, Comput. Speech Lang..

[27]  Gareth M. James,et al.  Challenges For Spoken Dialogue Systems , 1999 .

[28]  Hung-An Chang,et al.  Discriminative training of hierarchical acoustic models for large vocabulary continuous speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[30]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[31]  Igor Malioutov,et al.  Minimum Cut Model for Spoken Lecture Segmentation , 2006, ACL.

[32]  Alex Acero,et al.  Spoken Language Understanding "” An Introduction to the Statistical Framework , 2005 .

[33]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[34]  Horst Bunke,et al.  Using a Statistical Language Model to Improve the Performance of an HMM-Based Cursive Handwriting Recognition System , 2001, Int. J. Pattern Recognit. Artif. Intell..

[35]  Hui Jiang,et al.  Large margin HMMs for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[36]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[37]  S. Furui Recent Advances in Spontaneous Speech Recognition and Understanding , 2003 .

[38]  Alexander I. Rudnicky,et al.  Interactive ASR Error Correction for Touchscreen Devices , 2008, ACL.

[39]  Marcello Federico,et al.  Unsupervised Language and Acoustic Model Adaptation for Cross Domain Portability , 2001 .

[40]  Tatsuya Kawahara,et al.  Task adaptation using MAP estimation in N-gram language modeling , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  Miroslav Dudík,et al.  Hierarchical maximum entropy density estimation , 2007, ICML '07.

[42]  Gökhan Tür,et al.  Unsupervised Languagemodel Adaptation for Meeting Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[43]  Marcello Federico,et al.  Efficient language model adaptation through MDI estimation , 1999, EUROSPEECH.

[44]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[45]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[46]  Hui Ye,et al.  A clustering approach to semantic decoding , 2006, INTERSPEECH.

[47]  Jean-Luc Gauvain,et al.  Unsupervised language model adaptation for broadcast news , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[48]  Tatsuya Kawahara,et al.  Language model and speaking rate adaptation for spontaneous presentation speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[49]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[50]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[51]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[52]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[53]  Wolfgang Reichl Language model adaptation using minimum discrimination information , 1999, EUROSPEECH.

[54]  Wray L. Buntine,et al.  Discrete Principal Component Analysis , 2005 .

[55]  James R. Glass,et al.  N-gram Weighting: Reducing Training Data Mismatch in Cross-Domain Language Model Estimation , 2008, EMNLP.

[56]  Feifan Liu,et al.  Unsupervised language model adaptation via topic modeling based on named entity hypotheses , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[57]  Bhuvana Ramabhadran,et al.  An Iterative Relative Entropy Minimization-Based Data Selection Approach for n-Gram Model Adaptation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[58]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[59]  Andreas Stolcke,et al.  Web resources for language modeling in conversational speech recognition , 2007, TSLP.

[60]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[61]  Kenneth Ward Church,et al.  Compressing Trigram Language Models With Golomb Coding , 2007, EMNLP.

[62]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[63]  William W. Cohen,et al.  Language-Independent Set Expansion of Named Entities Using the Web , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[64]  James R. Glass,et al.  Style & Topic Language Model Adaptation Using HMM-LDA , 2006, EMNLP.

[65]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[66]  Richard J. Mammone,et al.  Use of non-negative matrix factorization for language model adaptation in a lecture transcription task , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[67]  Dietrich Klakow,et al.  Language model adaptation using dynamic marginals , 1997, EUROSPEECH.

[68]  Tatsuya Kawahara,et al.  UNSUPERVISED LANGUAGE MODEL ADAPTATION FOR LECTURE SPEECH RECOGNITION , 2003 .

[69]  Panayiotis G. Georgiou,et al.  Text data acquisition for domain-specific language models , 2006, EMNLP.

[70]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[71]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[72]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[73]  Arne Jönsson,et al.  Wizard of Oz studies: why and how , 1993, IUI '93.

[74]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[75]  Andreas Stolcke,et al.  Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures , 2003, NAACL.

[76]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..

[77]  Mauro Cettolo,et al.  Efficient Handling of N-gram Language Models for Statistical Machine Translation , 2007, WMT@ACL.

[78]  Thomas Niesler,et al.  Unsupervised language model adaptation for lecture speech transcription , 2002, INTERSPEECH.

[79]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[80]  Brian Roark,et al.  MAP adaptation of stochastic grammars , 2006, Comput. Speech Lang..

[81]  Hitoshi Isahara,et al.  Spontaneous Speech Corpus of Japanese , 2000, LREC.

[82]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[83]  Miroslav Dudík,et al.  Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling , 2007, J. Mach. Learn. Res..

[84]  James R. Curran,et al.  Adding Noun Phrase Structure to the Penn Treebank , 2007, ACL.

[85]  Timothy J. Hazen Automatic alignment and error correction of human generated transcripts for long speech recordings , 2006, INTERSPEECH.

[86]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[87]  Ronald Rosenfeld,et al.  Using story topics for language model adaptation , 1997, EUROSPEECH.

[88]  James R. Glass,et al.  Analysis and Processing of Lecture Audio Data: Preliminary Investigations , 2004, Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004 - SpeechIR '04.

[89]  Koby Crammer,et al.  Online Methods for Multi-Domain Learning and Adaptation , 2008, EMNLP.

[90]  Frank Keller,et al.  Using the Web to Overcome Data Sparseness , 2002, EMNLP.

[91]  Thorsten Brants,et al.  Randomized Language Models via Perfect Hash Functions , 2008, ACL.

[92]  Dilek Z. Hakkani-Tür,et al.  Active learning: theory and applications to automatic speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[93]  James R. Glass,et al.  Language model parameter estimation using user transcriptions , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[94]  Jianfeng Gao,et al.  Distribution-Based Pruning of Backoff Language Models , 2000, ACL.

[95]  M. Herzog,et al.  Combining word- and class-based language models: a comparative study in several languages using automatic and manual word-clustering techniques , 2001, INTERSPEECH.

[96]  Alexander Franz,et al.  Searching the Web by Voice , 2002, COLING.

[97]  Salim Roukos,et al.  MDI adaptation of language models across corpora , 1997, EUROSPEECH.

[98]  Imed Zitouni,et al.  Backoff hierarchical class n-gram language models: effectiveness to model unseen events in speech recognition , 2007, Comput. Speech Lang..

[99]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[100]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[101]  Tetsuya Ishikawa,et al.  Unsupervised topic adaptation for lecture speech retrieval , 2004, INTERSPEECH.

[102]  Taiyi Huang,et al.  An improved MAP method for language model adaptation , 1999, EUROSPEECH.

[103]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[104]  Jianfeng Gao,et al.  IMPROVING LANGUAGE MODELING BY COMBINING HETEOGENEOUS CORPORA , 2002 .

[105]  Todd L. Veldhuizen,et al.  Expression templates , 1996 .

[106]  Jun'ichi Tsujii,et al.  Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization , 2005, Machine Learning.

[107]  Mauro Cettolo,et al.  Language modeling and transcription of the TED corpus lectures , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[108]  Jianfeng Gao,et al.  MSRLM: a Scalable Language Modeling Toolkit , 2007 .

[109]  Peng Yu,et al.  Towards Spoken-Document Retrieval for the Internet: Lattice Indexing For Large-Scale Web-Search Architectures , 2006, NAACL.

[110]  Tatsuya Kawahara,et al.  Automatic transcription of lecture speech using topic-independent language modeling , 2000, INTERSPEECH.

[111]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[112]  James R. Glass,et al.  Automatic processing of audio lectures for information retrieval: vocabulary selection and language modeling , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[113]  Marcello Federico,et al.  Language model adaptation through topic decomposition and MDI estimation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[114]  Gerald Penn,et al.  Web-based language modelling for automatic lecture transcription , 2007, INTERSPEECH.

[115]  James R. Glass,et al.  Iterative language model estimation: efficient data structure & algorithms , 2008, INTERSPEECH.

[116]  Fabio Brugnara,et al.  Advances in the automatic transcription of lectures , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[117]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[118]  Robert L. Mercer,et al.  Adaptive Language Modeling Using Minimum Discriminant Estimation , 1992, HLT.

[119]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[120]  Anthony J. Robinson,et al.  Language model adaptation using mixtures and an exponentially decaying cache , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[121]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[122]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[123]  Jerome R. Bellegarda,et al.  Statistical language model adaptation: review and perspectives , 2004, Speech Commun..

[124]  James R. Glass,et al.  Recent progress in the MIT spoken lecture processing project , 2007, INTERSPEECH.

[125]  Marcello Federico,et al.  Bayesian estimation methods for n-gram language model adaptation , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[126]  Victor Zue,et al.  Data collection and performance evaluation of spoken dialogue systems: the MIT experience , 2000, INTERSPEECH.

[127]  Michèle Jardino Multilingual stochastic n-gram class language models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[128]  Jianfeng Gao,et al.  A unified approach to statistical language modeling for Chinese , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[129]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[130]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[131]  Robert L. Mercer,et al.  Adaptive language modeling using minimum discriminant estimation , 1992 .

[132]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[133]  Bhiksha Raj,et al.  Quantization-based language model compression , 2001, INTERSPEECH.

[134]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[135]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[136]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[137]  Alex Acero,et al.  Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lo , 2006, Comput. Speech Lang..

[138]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[139]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[140]  Tanja Schultz,et al.  Unsupervised language model adaptation using latent semantic marginals , 2006, INTERSPEECH.

[141]  Ronald Rosenfeld,et al.  Improving trigram language modeling with the World Wide Web , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[142]  Hauke Schramm,et al.  The thoughtful elephant: strategies for spoken dialog systems , 2000, IEEE Trans. Speech Audio Process..

[143]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[144]  James R. Glass,et al.  Spoken Correction for Chinese Text Entry , 2006, ISCSLP.

[145]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[146]  Jean-Luc Gauvain,et al.  Language modeling for broadcast news transcription , 1999, EUROSPEECH.

[147]  Rony Kubat,et al.  Totalrecall: visualization and semi-automatic annotation of very large audio-visual corpora , 2007, ICMI '07.

[148]  Mari Ostendorf,et al.  Modeling long distance dependence in language: topic mixtures versus dynamic cache models , 1996, IEEE Trans. Speech Audio Process..