Exploiting Representations from Statistical Machine Translation for Cross-Language Information Retrieval

This work explores how internal representations of modern statistical machine translation systems can be exploited for cross-language information retrieval. We tackle two core issues that are central to query translation: how to exploit context to generate more accurate translations and how to preserve ambiguity that may be present in the original query, thereby retaining a diverse set of translation alternatives. These two considerations are often in tension since ambiguity in natural language is typically resolved by exploiting context, but effective retrieval requires striking the right balance. We propose two novel query translation approaches: the grammar-based approach extracts translation probabilities from translation grammars, while the decoder-based approach takes advantage of n-best translation hypotheses. Both are context-sensitive, in contrast to a baseline context-insensitive approach that uses bilingual dictionaries for word-by-word translation. Experimental results show that by “opening up” modern statistical machine translation systems, we can access intermediate representations that yield high retrieval effectiveness. By combining evidence from multiple sources, we demonstrate significant improvements over competitive baselines on standard cross-language information retrieval test collections. In addition to effectiveness, the efficiency of our techniques are explored as well.

[1]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[2]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[3]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[4]  Dong Zhou,et al.  The Effectiveness of Results Re-Ranking and Query Expansion in Cross-language Information Retrieval , 2010, NTCIR.

[5]  Hae-Chang Rim,et al.  Improving query translation in English-Korean cross-language information retrieval , 2005, Inf. Process. Manag..

[6]  Jianqiang Wang,et al.  Combining bidirectional translation and synonymy for cross-language information retrieval , 2006, SIGIR.

[7]  Jimmy J. Lin,et al.  Combining Statistical Translation Techniques for Cross-Language Information Retrieval , 2012, COLING.

[8]  Wessel Kraaij,et al.  Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval , 2003, CL.

[9]  Jinxi Xu,et al.  Empirical studies on the impact of lexical resources on CLIR performance , 2005, Inf. Process. Manag..

[10]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[11]  Jianqiang Wang,et al.  Mandarin-English Information (MEI): investigating translingual speech retrieval , 2004, Comput. Speech Lang..

[12]  C. J. van Rijsbergen,et al.  Phrase Identification in Cross-Language Information Retrieval , 2000, RIAO.

[13]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[14]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[15]  Adam Lopez,et al.  Hierarchical Phrase-Based Translation with Suffix Arrays , 2007, EMNLP.

[16]  Sanjeev Khudanpur,et al.  Variational Decoding for Statistical Machine Translation , 2009, ACL.

[17]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[18]  Aitao Chen Phrasal Translation for English-Chinese Cross Language Information Retrieval , 2001 .

[19]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[20]  Marcello Federico,et al.  Statistical cross-language information retrieval using n-best query translations , 2002, SIGIR '02.

[21]  Changning Huang,et al.  Improving query translation for cross-language information retrieval using statistical models , 2001, SIGIR '01.

[22]  Vladimir Eidelman,et al.  cdec: A Decoder, Alignment, and Learning Framework for Finite- State and Context-Free Translation Models , 2010, ACL.

[23]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[24]  Jimmy J. Lin,et al.  Flat vs. hierarchical phrase-based translation models for cross-language information retrieval , 2013, SIGIR.

[25]  Jinxi Xu,et al.  Evaluating a probabilistic model for cross-lingual information retrieval , 2001, SIGIR '01.

[26]  Jianfeng Gao,et al.  Dependence language model for information retrieval , 2004, SIGIR '04.

[27]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[28]  Jacques Savoy,et al.  Experiments with Monolingual, Bilingual, and Robust Retrieval , 2006, CLEF.

[29]  Daniel Marcu,et al.  A Phrase-Based,Joint Probability Model for Statistical Machine Translation , 2002, EMNLP.

[30]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[31]  Hermann Ney,et al.  Improved Alignment Models for Statistical Machine Translation , 1999, EMNLP.

[32]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[33]  Avi Arampatzis,et al.  Phase-Based Information Retrieval , 1998, Inf. Process. Manag..

[34]  Alexander M. Fraser,et al.  TREC 2001 Cross-lingual Retrieval at BBN , 2001, TREC.

[35]  Yi Liu,et al.  A maximum coherence model for dictionary-based cross-language information retrieval , 2005, SIGIR '05.

[36]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[37]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[38]  Mirna Adriani,et al.  Query and Document Translation for English-Indonesian Cross Language IR , 2006, CLEF.

[39]  Avi Arampatzis,et al.  Phrase-based Information Retrieval , 1998 .

[40]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[41]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[42]  Jimmy J. Lin,et al.  Looking inside the box: context-sensitive translation for cross-language information retrieval , 2012, SIGIR '12.

[43]  Ferhan Ture,et al.  Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation , 2013 .

[44]  J. Scott McCarley Should we Translate the Documents or the Queries in Cross-language Information Retrieval? , 1999, ACL.

[45]  Douglas W. Oard,et al.  Combining LVCSR and vocabulary-independent ranked utterance retrieval for robust speech search , 2009, SIGIR.

[46]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[47]  Douglas W. Oard,et al.  Probabilistic structured query methods , 2003, SIGIR.

[48]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[49]  Noriko Kando,et al.  A Hybrid Approach to Query and Document Translation Using a Pivot Language for Cross-Language Information Retrieval , 2005, CLEF.

[50]  Salim Roukos,et al.  Fast document translation for cross-language information retrieval , 1998, AMTA.

[51]  Yanjun Ma,et al.  Opening Machine Translation Black Box for Cross-Language Information Retrieval , 2012, AIRS.

[52]  Douglas W. Oard,et al.  A comparative study of query and document translation for cross-language information retrieval , 1998, AMTA.

[53]  Kui-Lam Kwok,et al.  English-Chinese Cross-Language Retrieval based on a Translation Package , 1999 .

[54]  Susan T. Dumais,et al.  Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing , 1998 .

[55]  Dan Wu,et al.  Exploring the Further Integration of Machine Translation in Multilingual Information Access , 2010 .

[56]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[57]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[58]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[59]  Walid Magdy,et al.  Should MT Systems Be Used as Black Boxes in CLIR? , 2011, ECIR.

[60]  Douglas W. Oard,et al.  Document Translation for Cross-Language Text Retrieval at the University of Maryland , 1997, TREC.

[61]  Joseph Olive,et al.  Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation , 2011 .

[62]  W. Bruce Croft,et al.  Dictionary Methods for Cross-Lingual Information Retrieval , 1996, DEXA.

[63]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[64]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[65]  Christof Monz,et al.  Adaptation of Statistical Machine Translation Model for Cross-Lingual Information Retrieval in a Service Context , 2012, EACL.

[66]  Jian-Yun Nie,et al.  Constructing better document and query models with markov chains , 2006, CIKM '06.