论文信息 - Cross-lingual latent semantic analysis for language modeling

Cross-lingual latent semantic analysis for language modeling

Statistical language model estimation requires large amounts of domain-specific text, which is difficult to obtain in many languages. We propose techniques which exploit domain-specific text in a resource-rich language to adapt a language model in a resource-deficient language. A primary advantage of our technique is that in the process of cross-lingual language model adaptation, we do not rely on the availability of any machine translation capability. Instead, we assume that only a modest-sized collection of story-aligned document-pairs in the two languages is available. We use ideas from cross-lingual latent semantic analysis to develop a single low-dimensional representation shared by words and documents in both languages, which enables us to (i) find documents in the resource-rich language pertaining to a specific story in the resource-deficient language, and (ii) extract statistics from the pertinent documents to adapt a language model to the story of interest. We demonstrate significant reductions in perplexity and error rates in a Mandarin speech recognition task using this technique.

Sanjeev Khudanpur | Woosung Kim

[1] Michael L. Littman,et al. Automatic Cross-Language Retrieval Using Latent Semantic Indexing , 1997 .

[2] Tanja Schultz,et al. Language independent and language adaptive large vocabulary speech recognition , 1998, ICSLP.

[3] T. Kamm,et al. Pronunciation Modeling of Mandarin Casual Speech , 2000 .

[4] Susan T. Dumais,et al. Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[5] Sanjeev Khudanpur,et al. Contemporaneous text as side-information in statistical language modeling , 2004, Comput. Speech Lang..

[6] David Yarowsky,et al. Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[7] Sanjeev Khudanpur,et al. Cross-Lingual Lexical Triggers in Statistical Language Modeling , 2003, EMNLP.

[8] Daniel Jurafsky,et al. Towards better integration of semantic predictors in statistical language modeling , 1998, ICSLP.

[9] William J. Byrne,et al. Towards language independent acoustic modeling , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10] Sanjeev Khudanpur,et al. Using cross-language cues for story-specific language modeling , 2002, INTERSPEECH.

[11] John B. Shoven,et al. I , Edinburgh Medical and Surgical Journal.