Adaptation of machine translation for multilingual information retrieval in the medical domain

OBJECTIVE We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve effectiveness of cross-lingual IR. METHODS AND DATA Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech-English, German-English, and French-English. MT quality is evaluated on data sets created within the Khresmoi project and IR effectiveness is tested on the CLEF eHealth 2013 data sets. RESULTS The search query translation results achieved in our experiments are outstanding - our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech-English, from 23.03 to 40.82 for German-English, and from 32.67 to 40.82 for French-English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French-English. For Czech-English and German-English, the increased MT quality does not lead to better IR results. CONCLUSIONS Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance - better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions.

[1]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[2]  Jan Hajic Disambiguation of Rich Inflection - Computational Morphology of Czech , 2004 .

[3]  Zdenek Zabokrtský,et al.  TectoMT: Modular NLP Framework , 2010, IceTAL.

[4]  Bhuvana Ramabhadran,et al.  Automatic recognition of spontaneous speech for access to multilingual oral history archives , 2004, IEEE Transactions on Speech and Audio Processing.

[5]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[6]  Zdenek Zabokrtský,et al.  Improving English-Czech Tectogrammatical MT , 2009, Prague Bull. Math. Linguistics.

[7]  Padmini Srinivasan,et al.  Cross-language information retrieval with the UMLS metathesaurus , 1998, SIGIR '98.

[8]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[9]  Stefan Schulz,et al.  Large-Scale Evaluation of a Medical Cross-Language Information Retrieval System , 2007, MedInfo.

[10]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[11]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[12]  Donia Scott,et al.  Integrating Content and Style in Documents: A Case Study of Patient Information Leaflets , 1998 .

[13]  Ming Zhou,et al.  Optimizing Synonym Extraction Using Monolingual and Bilingual Resources , 2003, IWP@ACL.

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  Gareth J. F. Jones,et al.  DCU@TRECMed 2012: Using adhoc Baselines for Domain-Specific Retrieval , 2012, TREC.

[16]  Cornelius Rosse,et al.  The Foundational Model of Anatomy Ontology , 2008, Anatomy Ontologies for Bioinformatics.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Masatoshi Yoshikawa,et al.  Query term disambiguation for Web cross-language information retrieval using a search engine , 2000, IRAL '00.

[19]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[20]  Walid Magdy,et al.  An efficient method for using machine translation technologies in cross-language patent search , 2011, CIKM '11.

[21]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[22]  Douglas W. Oard,et al.  Probabilistic structured query methods , 2003, SIGIR.

[23]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[24]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[25]  David S. Wishart,et al.  DrugBank 3.0: a comprehensive resource for ‘Omics’ research on drugs , 2010, Nucleic Acids Res..

[26]  Tiejun Zhao,et al.  Train the Machine with What It Can Learn - Corpus Selection for SMT , 2011, BUCC@ACL/IJCNLP.

[27]  William R. Hersh,et al.  Tasks, topics and relevance judging for the TREC Genomics Track: five years of experience evaluating biomedical text information retrieval systems , 2009, Information Retrieval.

[28]  Douglas W. Oard,et al.  CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation , 2000, CLEF.

[29]  Antonio Jimeno-Yepes,et al.  Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text , 2013, BMC Bioinformatics.

[30]  Francis M. Tyers,et al.  Apertium: a free/open-source platform for rule-based machine translation , 2011, Machine Translation.

[31]  Christof Monz,et al.  Adaptation of Statistical Machine Translation Model for Cross-Lingual Information Retrieval in a Service Context , 2012, EACL.

[32]  Carla Teixeira Lopes,et al.  Measuring the value of health query translation: An analysis by user language proficiency , 2013, J. Assoc. Inf. Sci. Technol..

[33]  Arianna Bisazza,et al.  Cutting the Long Tail: Hybrid Language Models for Translation Style Adaptation , 2012, EACL.

[34]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[35]  Barry Haddow,et al.  Improved Minimum Error Rate Training in Moses , 2009, Prague Bull. Math. Linguistics.

[36]  Gobinda G. Chowdhury,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2007 .

[37]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[38]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[39]  Alexander H. Waibel,et al.  Improving Statistical Machine Translation in the Medical Domain using the Unified Medical Language system , 2004, COLING.

[40]  Alexander M. Fraser,et al.  Domain Adaptation in Machine Translation : Final Report , 2013 .

[41]  Gareth J. F. Jones,et al.  Creation of a New Evaluation Benchmark for Information Retrieval Targeting Patient Information Needs , 2013, EVIA@NTCIR.

[42]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[43]  M. de Rijke,et al.  Monolingual Document Retrieval for European Languages , 2004, Information Retrieval.

[44]  Ellen M. Voorhees,et al.  Overview of the TREC 2006 , 2007, TREC.

[45]  Josef van Genabith,et al.  Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study , 2012, EAMT.

[46]  Carol Peters,et al.  Cross-Language Information Retrieval and Evaluation , 2001, Lecture Notes in Computer Science.

[47]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language retrieval , 1998, SIGIR '98.

[48]  Paul Buitelaar,et al.  Semantic annotation for concept-based cross-language medical information retrieval , 2002, Int. J. Medical Informatics.

[49]  Andy Way,et al.  Domain Adaptation in SMT of User-Generated Forum Content Guided by OOV Word Reduction: Normalization and/or Supplementary Data , 2012, EAMT.

[50]  Ondrej Bojar,et al.  The Design of Eman, an Experiment Manager , 2013, Prague Bull. Math. Linguistics.

[51]  K. Nakayama,et al.  Wikipedia Mining Wikipedia as a Corpus for Knowledge Extraction , 2008 .

[52]  Stefan Riezler,et al.  Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus , 2012, IRFC.

[53]  Steve McDonald,et al.  Development of the Cochrane Collaboration’s Central Register of Controlled Clinical Trials , 2002, Evaluation & the health professions.

[54]  Mary Jo Ondrechen,et al.  Protein function annotation with Structurally Aligned Local Sites of Activity (SALSAs) , 2013, BMC Bioinformatics.

[55]  Sanna Salanterä,et al.  Overview of the ShARe/CLEF eHealth Evaluation Lab 2013 , 2013, CLEF.

[56]  Hermann Ney,et al.  Improving SMT quality with morpho-syntactic analysis , 2000, COLING.

[57]  Philippe Langlais,et al.  Improving a general-purpose Statistical Translation Engine by Terminological lexicons , 2002, COLING 2002.

[58]  Andy Way,et al.  From Subtitles to Parallel Corpora , 2012, EAMT.

[59]  Philipp Koehn,et al.  Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[60]  Philipp Koehn,et al.  Findings of the 2012 Workshop on Statistical Machine Translation , 2012, WMT@NAACL-HLT.

[61]  Ying Zhang,et al.  Domain-Specific Query Translation for Multilingual Information Access using Machine Translation Augmented With Dictionaries Mined from Wikipedia , 2008, IJCNLP.

[62]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[63]  M. Chial,et al.  in simple , 2003 .

[64]  Miguel E. Ruiz,et al.  CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation , 1999, TREC.

[65]  R. J. Cline,et al.  Consumer health information seeking on the Internet: the state of the art. , 2001, Health education research.

[66]  Philipp Koehn,et al.  Dirt Cheap Web-Scale Parallel Text from the Common Crawl , 2013, ACL.

[67]  Andy Way,et al.  Experiments on Domain Adaptation for Patent Machine Translation in the PLuTO project , 2011, EAMT.

[68]  Sophia Ananiadou,et al.  Construction of an annotated corpus to support biomedical information extraction , 2009, BMC Bioinformatics.

[69]  Ondrej Bojar,et al.  A Grain of Salt for the WMT Manual Evaluation , 2011, WMT@EMNLP.

[70]  Preslav Nakov,et al.  Improving English-Spanish Statistical Machine Translation: Experiments in Domain Adaptation, Sentence Paraphrasing, Tokenization, and Recasing , 2008, WMT@ACL.

[71]  Adam Lopez,et al.  Proceedings of the Seventh Workshop on Statistical Machine Translation , 2012 .

[72]  Jean-Michel Renders,et al.  Automatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval , 2005, Artif. Intell. Medicine.

[73]  Ondrej Dusek,et al.  The Joy of Parallelism with CzEng 1.0 , 2012, LREC.

[74]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[75]  Jian-Yun Nie Cross-Language Information Retrieval , 2010, Cross-Language Information Retrieval.

[76]  Aviezri S. Fraenkel,et al.  Local Feedback in Full-Text Retrieval Systems , 1977, JACM.

[77]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[78]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[79]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[80]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[81]  Hal Daumé,et al.  Domain Adaptation for Machine Translation by Mining Unseen Words , 2011, ACL.

[82]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[83]  Philipp Koehn,et al.  Experiments in Domain Adaptation for Statistical Machine Translation , 2007, WMT@ACL.

[84]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[85]  Martin Majlis,et al.  Yet Another Language Identifier , 2012, EACL.

[86]  Karen Sparck Jones,et al.  Okapi at TREC{7: automatic ad hoc, ltering, VLC and interactive track , 1999 .

[87]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[88]  Hua Wu,et al.  Improving domain-specific word alignment with a general bilingual corpus , 2004, AMTA.

[89]  Ellen M. Voorhees,et al.  Overview of the TREC 2012 Medical Records Track , 2012, TREC.

[90]  Hermann Ney,et al.  Combining translation and language model scoring for domain-specific data filtering , 2011, IWSLT.

[91]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[92]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[93]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[94]  Carl S. Wise Multiple word coding vs. random coding for the rapid selector. A reply to calvin N. mooers , 1952 .

[95]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[96]  U Hahn,et al.  MorphoSaurus , 2005, Methods of Information in Medicine.

[97]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[98]  Enrique Alfonseca,et al.  Decompounding query keywords from compounding languages , 2008, ACL.

[99]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[100]  Marcello Federico,et al.  Statistical cross-language information retrieval using n-best query translations , 2002, SIGIR '02.

[101]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[102]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[103]  Andy Way,et al.  Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation , 2011, EAMT.

[104]  Anita Burgun-Parenthoine,et al.  Experiments in cross-language medical information retrieval using a mixing translation module , 2004, MedInfo.

[105]  Stéfan Jacques Darmoni,et al.  Performance evaluation of unified medical language system®'s synonyms expansion to query PubMed , 2012, BMC Medical Informatics and Decision Making.

[106]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[107]  Stephen E. Robertson,et al.  On Term Selection for Query Expansion , 1991, J. Documentation.

[108]  Philipp Koehn,et al.  Explorer Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation , 2005 .

[109]  Arianna Bisazza,et al.  Fill-up versus interpolation methods for phrase-based SMT adaptation , 2011, IWSLT.

[110]  Aitao Chen,et al.  Cross-language Retrieval Experiments at CLEF 2002 , 2002, CLEF.

[111]  Andreas Eisele,et al.  MultiUN: A Multilingual Corpus from United Nation Documents , 2010, LREC.

[112]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[113]  Alex Waibel,et al.  Adaptation of the translation model for statistical machine translation based on information retrieval , 2005, EAMT.

[114]  Rudolf Rosa,et al.  Chimera - Three Heads for English-to-Czech Translation , 2013, WMT@ACL.

[115]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[116]  Fei Xia,et al.  Statistical machine translation for biomedical text: are we there yet? , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[117]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[118]  Jan Hajič,et al.  The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech , 2007, ACL 2007.

[119]  Germán Sanchis-Trilles,et al.  Log-linear weight optimisation via Bayesian Adaptation in Statistical Machine Translation , 2010, COLING.

[120]  F B ROGERS,et al.  Medical Subject Headings , 1948, Nature.

[121]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[122]  Alexander H. Waibel,et al.  Language Model Adaptation for Statistical Machine Translation Based on Information Retrieval , 2004, LREC.

[123]  Hermann Ney,et al.  Statistical Machine Translation of German Compound Words , 2006, FinTAL.

[124]  Carl Heneghan,et al.  Using the Turning Research Into Practice (TRIP) database: how do clinicians really search? , 2007, Journal of the Medical Library Association : JMLA.

[125]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[126]  Allen C. Browne,et al.  Machine Translation-Supported Cross-Language Information Retrieval for a Consumer Health Resource , 2003, AMIA.

[127]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[128]  Marcello Federico,et al.  Domain Adaptation for Statistical Machine Translation with Monolingual Resources , 2009, WMT@EACL.

[129]  Carol Peters,et al.  Multilingual Information Retrieval , 2012, Springer Berlin Heidelberg.

[130]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[131]  Sanna Salanterä,et al.  ShARe/CLEF eHealth Evaluation Lab 2013, Task 3: Information Retrieval to Address Patients' Questions when Reading Clinical Reports , 2013, CLEF.

[132]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[133]  Philipp Koehn,et al.  Findings of the 2011 Workshop on Statistical Machine Translation , 2011, WMT@EMNLP.

[134]  Josef van Genabith,et al.  Simple and Effective Parameter Tuning for Domain Adaptation of Statistical Machine Translation , 2012, COLING.

[135]  Martha Haskell Clark Tasks , 1924 .

[136]  Dong Zhou,et al.  Translation techniques in cross-language information retrieval , 2012, CSUR.

[137]  Jianfeng Gao,et al.  A study of statistical models for query translation: finding a good unit of translation , 2006, SIGIR.

[138]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.