Scaling shrinkage-based language models

In [1], we show that a novel class-based language model, Model M, and the method of regularized minimum discrimination information (rMDI) models outperform comparable methods on moderate amounts of Wall Street Journal data. Both of these methods are motivated by the observation that shrinking the sum of parameter magnitudes in an exponential language model tends to improve performance [2]. In this paper, we investigate whether these shrinkage-based techniques also perform well on larger training sets and on other domains. First, we explain why good performance on large data sets is uncertain, by showing that gains relative to a baseline n-gram model tend to decrease as training set size increases. Next, we evaluate several methods for data/model combination with Model M and rMDI models on limited-scale domains, to uncover which techniques should work best on large domains. Finally, we apply these methods on a variety of medium-to-large-scale domains covering several languages, and show that Model M consistently provides significant gains over existing language models for state-of-the-art systems in both speech recognition and machine translation.

[1]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[2]  Bernard Mérialdo,et al.  A Dynamic Language Model for Speech Recognition , 1991, HLT.

[3]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[4]  Robert L. Mercer,et al.  Adaptive language modeling using minimum discriminant estimation , 1992 .

[5]  Robert L. Mercer,et al.  Adaptive Language Modeling Using Minimum Discriminant Estimation , 1992, HLT.

[6]  John D. Lafferty,et al.  Cluster Expansions and Iterative Scaling for Maximum Entropy Language Models , 1995, ArXiv.

[7]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[8]  Marcello Federico,et al.  Bayesian estimation methods for n-gram language model adaptation , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9]  Tatsuya Kawahara,et al.  Task adaptation using MAP estimation in N-gram language modeling , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Ronald Rosenfeld,et al.  Using story topics for language model adaptation , 1997, EUROSPEECH.

[12]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[13]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[14]  John D. Lafferty,et al.  Boosting and Maximum Likelihood for Exponential Models , 2001, NIPS.

[15]  Joshua Goodman,et al.  Sequential Conditional Generalized Iterative Scaling , 2002, ACL.

[16]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Jun'ichi Tsujii,et al.  Evaluation and Extension of Maximum Entropy Models with Inequality Constraints , 2003, EMNLP.

[19]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[20]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[21]  Geoffrey Zweig,et al.  Anatomy of an extremely fast LVCSR decoder , 2005, INTERSPEECH.

[22]  Geoffrey Zweig,et al.  The IBM 2004 conversational telephony system for rich transcription , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[23]  Mark J. F. Gales,et al.  Progress in the CU-HTK broadcast news transcription system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Geoffrey Zweig,et al.  Advances in speech transcription at IBM under the DARPA EARS program , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Ruhi Sarikaya,et al.  On the use of morphological analysis for dialectal Arabic speech recognition , 2006, INTERSPEECH.

[26]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[27]  Geoffrey Zweig,et al.  The IBM 2006 Gale Arabic ASR System , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[28]  Daniel Povey,et al.  Universal background model based speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Stanley F. Chen,et al.  Performance Prediction for Exponential Language Models , 2009, NAACL.

[30]  Stanley F. Chen,et al.  Shrinking Exponential Language Models , 2009, NAACL.