Multilingual information retrieval in the language modeling framework

Multilingual information retrieval (MLIR) provides results that are more comprehensive than those of mono- and cross-lingual retrieval. Methods for MLIR are categorized as: (1) Fusion-based methods that merge results from multiple retrieval runs, and (2) Direct methods that build a unique index for the entire collection. Merging results of individual runs reduces the overall effectiveness, while more effective direct methods suffer from either time complexity and memory overhead, or over-weighting of index terms. In this paper, we propose a direct MLIR approach by using the language modeling framework that includes a novel multilingual language model estimation for documents, and a new way to globally estimate word statistics. These contributions enable ranking documents in multiple languages in one retrieval phase without having the problems of the previous direct methods. Moreover, our approach has the advantage of accommodating multilingual feedback information which helps to prevent query drift, and consequently to improve the performance. Finally, we effectively address the common case of incomplete coverage of translation resources in our proposed estimation methods. Experimental results show that the proposed approach outperforms the previous MLIR approaches.

[1]  Jian-Yun Nie Cross-Language Information Retrieval , 2010, Cross-Language Information Retrieval.

[2]  Fredric C. Gey,et al.  Combining Query Translation and Document Translation in Cross-Language Retrieval , 2003, CLEF.

[3]  Hsin-Hsi Chen,et al.  Merging Multilingual Information Retrieval Results Based on Prediction of Retrieval Effectiveness , 2004, NTCIR.

[4]  Hsin-Hsi Chen,et al.  Description of NTU Approach to NTCIR3 Multilingual Information Retrieval , 2002, NTCIR.

[5]  Kazuaki Kishida,et al.  Technical issues of cross-language information retrieval: a review , 2005, Inf. Process. Manag..

[6]  Jianfeng Gao,et al.  Translingual Mining from Text Data , 2012, Mining Text Data.

[7]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[8]  John D. Lafferty,et al.  Information Retrieval as Statistical Translation , 2017 .

[9]  Jacques Savoy Report on CLEF-2001 Experiments: Effective Combined Query-Translation Approach , 2001, CLEF.

[10]  Jacques Savoy,et al.  Report on CLEF-2003 Multilingual Tracks , 2003, CLEF.

[11]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[12]  Carol Peters,et al.  Multilingual Information Retrieval , 2012, Springer Berlin Heidelberg.

[13]  James C. French,et al.  The impact of database selection on distributed searching , 2000, SIGIR '00.

[14]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[15]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[16]  Tao Tao,et al.  Diagnostic Evaluation of Information Retrieval Models , 2011, TOIS.

[17]  Jacques Savoy,et al.  Selection and Merging Strategies for Multilingual Information Retrieval , 2004, CLEF.

[18]  Jacques Savoy,et al.  Combining Multiple Strategies for Effective Monolingual and Cross-Language Retrieval , 2004, Information Retrieval.

[19]  Aitao Chen,et al.  Cross-language Retrieval Experiments at CLEF 2002 , 2002, CLEF.

[20]  Fernando Mart,et al.  A merging strategy proposal: The 2-step retrieval status value method , 2006 .

[21]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[22]  Pushpak Bhattacharyya,et al.  Multilingual PRF: english lends a helping hand , 2010, SIGIR.

[23]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[24]  Gosse Bouma,et al.  ADVANCES IN MULTILINGUAL AND MULTIMODAL INFORMATION RETRIEVAL , 2008 .

[25]  Luo Si,et al.  CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists , 2005, CLEF.

[26]  Martin Braschler,et al.  Using Corpus-Based Approaches in a System for Multilingual Information Retrieval , 2000, Information Retrieval.

[27]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[28]  Jacques Savoy,et al.  Report on CLEF-2002 Experiments: Combining Multiple Sources of Evidence , 2002, CLEF.

[29]  Jian-Yun Nie,et al.  A Multilingual Approach to Multilingual Information Retrieval , 2002, CLEF.

[30]  Jian-Yun Nie,et al.  Merging Different Languages in a Single Document Collection , 2002, CLEF.

[31]  John D. Lafferty,et al.  A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.

[32]  Jacques Savoy,et al.  Database merging strategy based on logistic regression , 2000, Inf. Process. Manag..

[33]  Jinxi Xu,et al.  Evaluating a probabilistic model for cross-lingual information retrieval , 2001, SIGIR '01.

[34]  Giorgio Maria Di Nunzio,et al.  How robust are multilingual information retrieval systems? , 2008, SAC '08.

[35]  Martin Braschler,et al.  Experiments with the Eurospider Retrieval System for CLEF 2001 , 2000, CLEF.

[36]  Charu C. Aggarwal,et al.  Mining Text Data , 2012 .

[37]  Hsin-Hsi Chen,et al.  A study of learning a merge model for multilingual information retrieval , 2008, SIGIR '08.

[38]  Carol Peters,et al.  Advances in Multilingual and Multimodal Information Retrieval, 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, Budapest, Hungary, September 19-21, 2007, Revised Selected Papers , 2008, CLEF.

[39]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[40]  Gareth J. F. Jones,et al.  Dublin City University at CLEF 2004: Experiments in Monolingual, Bilingual and Multilingual Retrieval , 2004, CLEF.

[41]  Wessel Kraaij,et al.  Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval , 2003, CL.

[42]  Martin Braschler Combination Approaches for Multilingual Text Retrieval , 2004, Information Retrieval.

[43]  Fredric C. Gey,et al.  Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval , 2006 .

[44]  Philipp Cimiano,et al.  Exploiting Wikipedia for cross-lingual and multilingual information retrieval , 2012, Data Knowl. Eng..

[45]  Wessel Kraaij,et al.  Transitive probabilistic CLIR models , 2004 .

[46]  Luo Si,et al.  An effective and efficient results merging strategy for multilingual information retrieval in federated search environments , 2007, Information Retrieval.

[47]  Wei Gao,et al.  Joint Ranking for Multilingual Web Search , 2009, ECIR.

[48]  Martin Braschler,et al.  Experiments with the Eurospider Retrieval System for CLEF 2000 , 2000, CLEF.

[49]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval: A Critical Review , 2008, Found. Trends Inf. Retr..

[50]  Carol Peters,et al.  CLEF 2006: Ad Hoc Track Overview , 2006, CLEF.

[51]  Eneko Agirre,et al.  Advances in Multilingual and Multimodal Information Retrieval. , 2008 .