Adaptation of Statistical Machine Translation Model for Cross-Lingual Information Retrieval in a Service Context

This work proposes to adapt an existing general SMT model for the task of translating queries that are subsequently going to be used to retrieve information from a target language collection. In the scenario that we focus on access to the document collection itself is not available and changes to the IR model are not possible. We propose two ways to achieve the adaptation effect and both of them are aimed at tuning parameter weights on a set of parallel queries. The first approach is via a standard tuning procedure optimizing for BLEU score and the second one is via a reranking approach optimizing for MAP score. We also extend the second approach by using syntax-based features. Our experiments show improvements of 1-2.5 in terms of MAP score over the retrieval with the non-adapted translation. We show that these improvements are due both to the integration of the adaptation and syntax-features for the query translation task.

[1]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[2]  Nello Cristianini,et al.  Estimating the Sentence-Level Quality of Machine Translation Systems , 2009, EAMT.

[3]  Marcello Federico,et al.  Domain Adaptation for Statistical Machine Translation with Monolingual Resources , 2009, WMT@EACL.

[4]  Wessel Kraaij,et al.  Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval , 2003, CL.

[5]  Philipp Koehn,et al.  Findings of the 2009 Workshop on Statistical Machine Translation , 2009, WMT@EACL.

[6]  Fariborz Mahmoudi,et al.  Cross-Language Information Retrieval Using Meta-language Index Construction and Structural Queries , 2009, CLEF.

[7]  Carol Peters,et al.  CLEF 2009 Ad Hoc Track Overview: TEL and Persian Tasks , 2009, CLEF.

[8]  John D. Lafferty,et al.  The Weaver System for Document Retrieval , 1999, TREC.

[9]  Chengqing Zong,et al.  Domain Adaptation for Statistical Machine Translation with Domain Dictionary and Monolingual Corpora , 2008, COLING.

[10]  Jian-Yun Nie,et al.  Exploiting the Web as Parallel Corpora for Cross- Language Information Retrieval , 2003 .

[11]  Jianfeng Gao,et al.  Statistical query translation models for cross-language information retrieval , 2006, TALIP.

[12]  Philip Resnik,et al.  Online Large-Margin Training of Syntactic and Structural Translation Features , 2008, EMNLP.

[13]  Kimmo Kettunen,et al.  Choosing the Best MT Programs for CLIR Purposes - Can MT Metrics Be Helpful? , 2009, ECIR.

[14]  Qiang Yang,et al.  Web query translation via web log mining , 2008, SIGIR '08.

[15]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[16]  Koby Crammer,et al.  Ultraconservative Online Algorithms for Multiclass Problems , 2001, J. Mach. Learn. Res..

[17]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[18]  Brian Roark,et al.  Incremental Parsing with the Perceptron Algorithm , 2004, ACL.

[19]  Jean-Michel Renders,et al.  Query Translation through Dictionary Adaptation , 2008, CLEF.

[20]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[21]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[22]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[23]  Hao Yu,et al.  Domain adaptation for statistical machine translation in development corpus selection , 2010, 2010 4th International Universal Communication Symposium.

[24]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[25]  Marc Dymetman,et al.  Experiments in Discriminating Phrase-Based Translations on the Basis of Syntactic Coupling Features , 2008, SSST@ACL.

[26]  Gareth Jones,et al.  Exploring the use of Machine Translation resources for English-Japanese Cross-Language Information Retrieval , 1999 .

[27]  Brian Roark,et al.  Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm , 2004, ACL.

[28]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[29]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[30]  Caroline Brun,et al.  Linguistically-Adapted Structural Query Annotation for Digital Libraries in the Social Sciences , 2012, LaTeCH@EACL.

[31]  Jean-Pierre Chanod,et al.  Robustness beyond shallowness: incremental deep parsing , 2002, Natural Language Engineering.

[32]  Taro Watanabe,et al.  Online Large-Margin Training for Statistical Machine Translation , 2007, EMNLP.

[33]  Philipp Koehn,et al.  Experiments in Domain Adaptation for Statistical Machine Translation , 2007, WMT@ACL.

[34]  Djoerd Hiemstra,et al.  Disambiguation Strategies for Cross-Language Information Retrieval , 1999, ECDL.

[35]  Philipp Koehn,et al.  Proceedings of the Third Workshop on Statistical Machine Translation (StatMT '08) , 2008 .

[36]  Stephan Vogel,et al.  Language Model Adaptation for Statistical Machine Translation via Structured Query Models , 2004, COLING.

[37]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[38]  Andy Way,et al.  Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation , 2011, EAMT.

[39]  Wei Gao,et al.  Exploiting query logs for cross-lingual query suggestions , 2010, TOIS.

[40]  Dan Wu,et al.  A Study of Query Translation Using Google Machine Translation System , 2010, 2010 International Conference on Computational Intelligence and Software Engineering.

[41]  Masao Utiyama,et al.  Evaluating effects of machine translation accuracy on cross-lingual patent retrieval , 2009, SIGIR.

[42]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.