Joint training of interpolated exponential n-gram models

For many speech recognition tasks, the best language model performance is achieved by collecting text from multiple sources or domains, and interpolating language models built separately on each individual corpus. When multiple corpora are available, it has also been shown that when using a domain adaptation technique such as feature augmentation [1], the performance on each individual domain can be improved by training a joint model across all of the corpora. In this paper, we explore whether improving each domain model via joint training also improves performance when interpolating the models together. We show that the diversity of the individual models is an important consideration, and propose a method for adjusting diversity to optimize overall performance. We present results using word n-gram models and Model M, a class-based n-gram model, and demonstrate improvements in both perplexity and word-error rate relative to state-of-the-art results on a Broadcast News transcription task.

[1]  Alex Acero,et al.  Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lo , 2006, Comput. Speech Lang..

[2]  Robert Miller,et al.  Just-in-time language modelling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Jun'ichi Tsujii,et al.  Evaluation and Extension of Maximum Entropy Models with Inequality Constraints , 2003, EMNLP.

[4]  Bo-June Paul Hsu,et al.  Generalized linear interpolation of language models , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[5]  Bhuvana Ramabhadran,et al.  Scaling shrinkage-based language models , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[6]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[7]  Bhuvana Ramabhadran,et al.  Distributed training of large scale exponential language models , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Stanley F. Chen,et al.  Shrinking Exponential Language Models , 2009, NAACL.

[9]  Christopher D. Manning,et al.  Hierarchical Bayesian Domain Adaptation , 2009, NAACL.

[10]  Jerome R. Bellegarda,et al.  Statistical language model adaptation: review and perspectives , 2004, Speech Commun..

[11]  Jun Wu,et al.  Combining nonlocal, syntactic and n-gram dependencies in language modeling , 1999, EUROSPEECH.

[12]  Cyril Allauzen,et al.  Bayesian Language Model Interpolation for Mobile Speech Input , 2011, INTERSPEECH.

[13]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[14]  Ronald Rosenfeld,et al.  Topic adaptation for language modeling using unnormalized exponential models , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[15]  Mark J. F. Gales,et al.  Context dependent language model adaptation , 2008, INTERSPEECH.

[16]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[17]  Dietrich Klakow,et al.  Log-linear interpolation of language models , 1998, ICSLP.