Crosslingual Information Retrieval System Enhanced with Transliteration Generation and Mining

This report documents the participation of Microsoft Research India (MSR India) in the Crosslingual Information Retrieval (CLIR) evaluation organized by the Forum for Information Retrieval Evaluation 2010 [FIRE 2010]. MSR India participated in two crosslingual evaluation tasks, namely the HindiEnglish and Tamil-English crosslingual tasks, in addition to the English-English monolingual task. Our core CLIR engine employed a language modeling based approach using query likelihood based document ranking and a probabilistic translation lexicon learned from English-Hindi and English-Tamil parallel corpora. In addition, we employed two specific techniques to deal with out-of-vocabulary terms in the crosslingual runs: first, generating transliterations directly or transitively, and second, mining possible transliteration equivalents from the documents retrieved in the firstpass. We show experimentally that each of these techniques significantly improved the overall retrieval performance of our crosslingual IR system. Our system, using all of the topic-description-and-narrative information, achieved the peak retrieval performance of a MAP of 0.5133 in the monolingual EnglishEnglish task; in crosslingual tasks, our systems achieved a peak performance of a MAP of 0.4977 in Hindi-English and 0.4145 in the Tamil-English. The post-task analyses indicate that the mining of appropriate transliterations from the top results of the first-pass retrieval achieved enhanced the crosslingual performance of our system overall, in addition to enhancing individual performance of more queries. Our Hindi-English crosslingual retrieval performance was nearly equal (~97%) to the English-English monolingual retrieval performance, indicating the effectiveness of our approaches to handle OOV‟s to enhance the baseline performance of our CLIR system.

[1]  Raghavendra Udupa,et al.  Microsoft Research India at FIRE2008: Hindi-English Cross-Language Information Retrieval , 2008 .

[2]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[3]  Bogdan Sacaleanu,et al.  Working Notes for the CLEF 2008 Workshop , 2008 .

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[6]  Xiaodong He Using Word-Dependent Transition Models in HMM-Based Word Alignment for Statistical Machine Translation , 2007, WMT@ACL.

[7]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[8]  K. Saravanan,et al.  MINT: A Method for Effective and Scalable Mining of Named Entity Transliterations from Large Comparable Corpora , 2009, EACL.

[9]  Pushpak Bhattacharyya,et al.  Everybody loves a rich cousin: An empirical study of transliteration through bridge languages , 2010, NAACL.

[10]  Haizhou Li,et al.  Report of NEWS 2010 Transliteration Generation Shared Task , 2010, NEWS@ACL.

[11]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[12]  Mandar Mitra,et al.  FIRE: Forum for Information Retrieval Evaluation , 2008, IJCNLP.

[13]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[14]  A. Kumaran,et al.  Cross-Lingual Information Retrieval System for Indian Languages , 2008, IJCNLP.

[15]  K. Saravanan,et al.  "They Are Out There, If You Know Where to Look": Mining Transliterations of OOV Query Terms for Cross-Language Information Retrieval , 2009, ECIR.

[16]  Prasenjit Majumder,et al.  Text collections for FIRE , 2008, SIGIR '08.