Bilingual-LSA Based LM Adaptation for Spoken Language Translation

We propose a novel approach to crosslingual language model (LM) adaptation based on bilingual Latent Semantic Analysis (bLSA). A bLSA model is introduced which enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bLSA framework crosslingual LM adaptation can be performed by, first, inferring the topic posterior distribution of the source text and then applying the inferred distribution to the target language N-gram LM via marginal adaptation. The proposed framework also enables rapid bootstrapping of LSA models for new languages based on a source LSA model from another language. On Chinese to English speech and text translation the proposed bLSA framework successfully reduced word perplexity of the English LM by over 27% for a unigram LM and up to 13.6% for a 4-gram LM. Furthermore, the proposed approach consistently improved machine translation quality on both speech and text based adaptation.

[1]  Sanjeev Khudanpur,et al.  Cross-lingual latent semantic analysis for language modeling , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Sanjeev Khudanpur,et al.  Language model adaptation using cross-lingual information , 2003, INTERSPEECH.

[3]  Eric P. Xing,et al.  BiTAM: Bilingual Topic AdMixture Models for Word Alignment , 2006, ACL.

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Tanja Schultz,et al.  Unsupervised language model adaptation using latent semantic marginals , 2006, INTERSPEECH.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  S. Vogel PESA: Phrase Pair Extraction as Sentence Splitting , 2005, MTSUMMIT.

[8]  Alex Waibel,et al.  Document Driven Machine Translation Enhanced Automatic Speech Recognition , 2005 .

[9]  Dietrich Klakow,et al.  Language model adaptation using dynamic marginals , 1997, EUROSPEECH.

[10]  S. Vogel,et al.  SMT decoder dissected: word reordering , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.