A maximum coherence model for dictionary-based cross-language information retrieval

One key to cross-language information retrieval is how to efficiently resolve the translation ambiguity of queries given their short length. This problem is even more challenging when only bilingual dictionaries are available, which is the focus of this paper. In the previous research of cross-language information retrieval using bilingual dictionaries, the word co-occurrence statistics is used to determine the most likely translations of queries. In this paper, we propose a novel statistical model, named ``maximum coherence model'', which estimates the translation probabilities of query words that are consistent with the word co-occurrence statistics. Unlike the previous work, where a binary decision is made for the selection of translations, the new model maintains the uncertainty in translating query words when their sense ambiguity is difficult to resolve. Furthermore, this new model is able to estimate translations of multiple query words simultaneously. This is in contrast to many previous approaches where translations of individual query words are determined independently. Empirical studies with TREC datasets have shown that the maximum coherence model achieves a relative 10% - 40% improvement in cross-language information retrieval, comparing to other approaches that also use word co-occurrence statistics for sense disambiguation.

[1]  Marcello Federico,et al.  Statistical cross-language information retrieval using n-best query translations , 2002, SIGIR '02.

[2]  Mirna Adriani Using Statistical Term Similarity for Sense Disambiguation in Cross-Language Information Retrieval , 2004, Information Retrieval.

[3]  Changning Huang,et al.  Improving query translation for cross-language information retrieval using statistical models , 2001, SIGIR '01.

[4]  Philip E. Gill,et al.  Practical optimization , 1981 .

[5]  Wessel Kraaij,et al.  Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval , 2003, CL.

[6]  Mirna Adriani Dictionary-based CLIR for the CLEF Multilingual Track , 2000, CLEF.

[7]  Alexander M. Fraser,et al.  TREC 2001 Cross-lingual Retrieval at BBN , 2001, TREC.

[8]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[9]  Djoerd Hiemstra,et al.  Twenty-One at TREC-8: using Language Technology for Information Retrieval , 1999, TREC.

[10]  Sung-Hyon Myaeng,et al.  Using Mutual Information to Resolve Query Translation Ambiguities and Query Term Weighting , 1999, ACL.

[11]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[12]  Jianfeng Gao,et al.  Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations , 2002, SIGIR '02.

[13]  Willem Meijs,et al.  Language and computers : studies in practical linguistics , 1998 .

[14]  Mark W. Davis,et al.  New Experiments In Cross-Language Text Retrieval At NMSU's Computing Research Lab , 1996, TREC.

[15]  Jinxi Xu,et al.  TREC-9 Cross-lingual Retrieval at BBN , 2000, TREC.

[16]  Jian-Yun Nie,et al.  Using Statistical Translation Models for Bilingual IR , 2001, CLEF.

[17]  Gerard Salton,et al.  The SMART Retrieval System , 1971 .

[18]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[19]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[20]  Masatoshi Yoshikawa,et al.  Query term disambiguation for Web cross-language information retrieval using a search engine , 2000, IRAL '00.