Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings

We present an approach to learning bilingual n-gram correspondences from relevance rankings of English documents for Japanese queries. We show that directly optimizing cross-lingual rankings rivals and complements machine translation-based cross-language information retrieval (CLIR). We propose an efficient boosting algorithm that deals with very large cross-product spaces of word correspondences. We show in an experimental evaluation on patent prior art search that our approach, and in particular a consensus-based combination of boosting and translation-based approaches, yields substantial improvements in CLIR performance. Our training and test data are made publicly available.

[1]  Douglas W. Oard,et al.  Probabilistic structured query methods , 2003, SIGIR.

[2]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[3]  Gareth J. F. Jones,et al.  Combination Methods for Improving the Reliability of Machine Translation Based Cross-Language Information Retrieval , 2002, AICS.

[4]  H. Damasio,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence: Special Issue on Perceptual Organization in Computer Vision , 1998 .

[5]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[6]  Christof Monz,et al.  Adaptation of Statistical Machine Translation Model for Cross-Lingual Information Retrieval in a Service Context , 2012, EACL.

[7]  Yunsong Guo,et al.  Ranking Structured Documents: A Large Margin Based Approach for Patent Prior Art Search , 2009, IJCAI.

[8]  M. I. Jordan Leo Breiman , 2011, 1101.0929.

[9]  Walid Magdy,et al.  An efficient method for using machine translation technologies in cross-language patent search , 2011, CIKM '11.

[10]  M. Utiyama,et al.  A Japanese-English patent parallel corpus , 2007, MTSUMMIT.

[11]  Vladimir Eidelman,et al.  cdec: A Decoder, Alignment, and Learning Framework for Finite- State and Context-Free Translation Models , 2010, ACL.

[12]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[13]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[14]  Xi Chen,et al.  Learning Preferences with Millions of Parameters by Enforcing Sparsity , 2010, 2010 IEEE International Conference on Data Mining.

[15]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[16]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[17]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[18]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[19]  Ji Zhu,et al.  Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..

[20]  Masao Utiyama,et al.  Overview of the Patent Translation Task at the NTCIR-7 Workshop , 2008, NTCIR.

[21]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[23]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[24]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[25]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language retrieval , 1998, SIGIR '98.

[26]  Leif Azzopardi,et al.  A Methodology for Building a Patent Test Collection for Prior Art Search , 2008, EVIA@NTCIR.

[27]  Jimmy J. Lin,et al.  Looking inside the box: context-sensitive translation for cross-language information retrieval , 2012, SIGIR '12.

[28]  Jimmy J. Lin,et al.  Combining Statistical Translation Techniques for Cross-Language Information Retrieval , 2012, COLING.

[29]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[30]  John Langford,et al.  Predictive Indexing for Fast Search , 2008, NIPS.

[31]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[32]  Syr Hui,et al.  US Patent Application , 2013 .

[33]  Changning Huang,et al.  Improving query translation for cross-language information retrieval using statistical models , 2001, SIGIR '01.

[34]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[35]  Wei Gao,et al.  Cross-lingual query suggestion using query logs of different languages , 2007, SIGIR.

[36]  John Langford,et al.  Hash Kernels , 2009, AISTATS.

[37]  S. T. Buckland,et al.  Computer-Intensive Methods for Testing Hypotheses. , 1990 .

[38]  Jinxi Xu,et al.  Evaluating a probabilistic model for cross-lingual information retrieval , 2001, SIGIR '01.

[39]  Walid Magdy,et al.  PRES: a score metric for evaluating recall-oriented information retrieval applications , 2010, SIGIR.

[40]  K. J. Evans,et al.  Computer Intensive Methods for Testing Hypotheses: An Introduction , 1990 .

[41]  Dmitry Yurievich Pavlov,et al.  BagBoo: a scalable hybrid bagging-the-boosting model , 2010, CIKM '10.

[42]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[43]  Markus Freitag,et al.  The RWTH Aachen System for NTCIR-10 PatentMT , 2013, NTCIR.

[44]  Cristina V. Lopes,et al.  Bagging gradient-boosted trees for high precision, low variance ranking models , 2011, SIGIR.

[45]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[46]  Yanjun Qi,et al.  Learning to rank with (a lot of) word features , 2010, Information Retrieval.

[47]  Adam Lopez,et al.  Hierarchical Phrase-Based Translation with Suffix Arrays , 2007, EMNLP.

[48]  E. A. Fox,et al.  Combining the Evidence of Multiple Query Representations for Information Retrieval , 1995, Inf. Process. Manag..