Combining lexical and statistical translation evidence for cross‐language information retrieval

This article explores how best to use lexical and statistical translation evidence together for cross‐language information retrieval (CLIR). Lexical translation evidence is assembled from Wikipedia and from a large machine‐readable dictionary, statistical translation evidence is drawn from parallel corpora, and evidence from co‐occurrence in the document language provides a basis for limiting the adverse effect of translation ambiguity. Coverage statistics for NII Testbeds and Community for Information Access Research (NTCIR) queries confirm that these resources have complementary strengths. Experiments with translation evidence from a small parallel corpus indicate that even rather rough estimates of translation probabilities can yield further improvements over a strong technique for translation weighting based on using Jensen–Shannon divergence as a term‐association measure. Finally, a novel approach to posttranslation query expansion using a random walk over the Wikipedia concept link graph is shown to yield further improvements over alternative techniques for posttranslation query expansion. Evaluation results on the NTCIR‐5 English–Korean test collection show statistically significant improvements over strong baselines.

[1]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[2]  Douglas W. Oard,et al.  Dictionary-based techniques for cross-language information retrieval , 2005, Inf. Process. Manag..

[3]  Philipp Cimiano,et al.  Cross-language Information Retrieval with Explicit Semantic Analysis , 2008, CLEF.

[4]  CLLE-ERSS,et al.  Query Translation using Wikipedia-based resources for analysis and disambiguation , 2010 .

[5]  Clement Yu,et al.  UIC at TREC 2008 Blog Track , 2008 .

[6]  Gang Wang,et al.  Understanding user's query intent with wikipedia , 2009, WWW '09.

[7]  Jaime G. Carbonell,et al.  Retrieval and feedback models for blog feed search , 2008, SIGIR '08.

[8]  Marcello Federico,et al.  Statistical cross-language information retrieval using n-best query translations , 2002, SIGIR '02.

[9]  Ido Dagan,et al.  Similarity-Based Methods for Word Sense Disambiguation , 1997, ACL.

[10]  J. Scott McCarley Should we Translate the Documents or the Queries in Cross-language Information Retrieval? , 1999, ACL.

[11]  András A. Benczúr,et al.  Cross-Language Retrieval with Wikipedia , 2008, CLEF.

[12]  Kevyn Collins-Thompson,et al.  Query expansion using random walk models , 2005, CIKM '05.

[13]  Douglas W. Oard,et al.  Multilingual Information Access , 2010 .

[14]  Douglas W. Oard,et al.  Cross-language Information Retrieval , 2021, ArXiv.

[15]  Paul McNamee,et al.  Textual representations for corpus-based bilingual retrieval , 2008 .

[16]  Wei Zhang,et al.  UIC at TREC 2006 Blog Track , 2006, TREC.

[17]  C. Avin,et al.  Efficient and robust query processing in dynamic environments using random walk techniques , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.

[18]  Djoerd Hiemstra,et al.  WikiTranslate: Query Translation for Cross-lingual Information Retrieval using only Wikipedia , 2008, CLEF.

[19]  Sora Choi,et al.  NTCIR-5 Chinese, English, Korean Cross Language Retrieval Experiments using PIRCS , 2005, NTCIR.

[20]  Jianqiang Wang,et al.  Combining bidirectional translation and synonymy for cross-language information retrieval , 2006, SIGIR.

[21]  Jian Hu,et al.  Improving Text Classification by Using Encyclopedia Knowledge , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[22]  Douglas W. Oard,et al.  Probabilistic structured query methods , 2003, SIGIR.

[23]  Hae-Chang Rim,et al.  Improving query translation in English-Korean cross-language information retrieval , 2005, Inf. Process. Manag..

[24]  Gilad Mishne,et al.  Applied text analytics for blogs , 2007 .

[25]  Andrew Trotman,et al.  Wikipedia and Web document based Query Translation and Expansion for Cross-language IR , 2010, NTCIR.

[26]  James Mayfield,et al.  Comparing cross-language query expansion techniques by degrading translation resources , 2002, SIGIR '02.

[27]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[28]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[29]  Douglas W. Oard,et al.  The effect of bilingual term list size on dictionary-based cross-language information retrieval , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[30]  Nigel Collier,et al.  A comparison of query translation methods for English-Japanese cross-language information retrieval (poster abstract) , 1999, SIGIR '99.

[31]  Eneko Agirre,et al.  Advances in Multilingual and Multimodal Information Retrieval. , 2008 .

[32]  Claire Fautsch,et al.  UniNE at TREC 2008: Fact and Opinion Retrieval in the Blogsphere , 2008, TREC.

[33]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.