Data selection for language modeling using sparse representations

The ability to adapt language models to specific domains from large generic text corpora is of considerable interest to the language modeling community. One of the key challenges is to identify the text material relevant to a domain in the generic text collection. The text selection problem can be cast in a semi-supervised learning framework where the initial hypothesis from a speech recognition system is used to identify relevant training material. We present a novel sparse representation formulation which selects a sparse set of relevant sentences from the training data which match the test set distribution. In this formulation, the training sentences are treated as the columns of the sparse representation matrix and the n-gram counts as the rows. The target vector is the n-gram probability distribution for the test data. A sparse solution to this problem formulation identifies a few columns which can best represent the target test vector, thus identifying the relevant set of sentences from the training data. Rescoring results with the language model built from the data selected using the proposed method yields modest gains on the English broadcast news RT-04 task, reducing the word error rate from 14.6% to 14.4%.

[1]  Panayiotis G. Georgiou,et al.  Text data acquisition for domain-specific language models , 2006, EMNLP.

[2]  Tatsuya Kawahara,et al.  A bootstrapping approach for developing language model of new spoken dialogue systems by selecting web texts , 2006, INTERSPEECH.

[3]  Ruhi Sarikaya,et al.  Rapid language model development using external resources for new spoken dialog domains , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[4]  Bhuvana Ramabhadran,et al.  An Iterative Relative Entropy Minimization-Based Data Selection Approach for n-Gram Model Adaptation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  D. Kanevsky,et al.  ABCS : Approximate Bayesian Compressed Sensing , 2009 .

[6]  Abhinav Sethy,et al.  Resampling auxiliary data for language model adaptation in machine translation for speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[8]  Bhuvana Ramabhadran,et al.  A study of unsupervised clustering techniques for language modeling , 2008, INTERSPEECH.

[9]  Steve J. Young,et al.  Bootstrapping language models for dialogue systems , 2006, INTERSPEECH.

[10]  Tara N. Sainath,et al.  Bayesian compressive sensing for phonetic classification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  J. Heckman Sample selection bias as a specification error , 1979 .

[12]  Steffen Bickel,et al.  Discriminative learning for differing training and test distributions , 2007, ICML '07.

[13]  Tanja Schultz,et al.  Correlated Latent Semantic Model for Unsupervised LM Adaptation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[14]  Mei-Yuh Hwang,et al.  Web-data augmented language models for Mandarin conversational speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[15]  Jerome R. Bellegarda,et al.  Statistical language model adaptation: review and perspectives , 2004, Speech Commun..

[16]  Geoffrey Zweig,et al.  Advances in speech transcription at IBM under the DARPA EARS program , 2006, IEEE Transactions on Audio, Speech, and Language Processing.