Translation of Biomedical Documents with Focus on Spanish-English

For the WMT 2018 shared task of translating documents pertaining to the Biomedical domain, we developed a scoring formula that uses an unsophisticated and effective method of weighting term frequencies and was integrated in a data selection pipeline. The method was applied on five language pairs and it performed best on Portuguese-English, where a BLEU score of 41.84 placed it third out of seven runs submitted by three institutions. In this paper, we describe our method and results with a special focus on Spanish-English where we compare it against a state-of-the-art method. Our contribution to the task lies in introducing a fast, unsupervised method for selecting domain-specific data for training models which obtain good results using only 10% of the general domain data.

[1]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[2]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Alex Waibel,et al.  Adaptation of the translation model for statistical machine translation based on information retrieval , 2005, EAMT.

[4]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[5]  Mariana L. Neves,et al.  The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine , 2016, LREC.

[6]  Ondrej Bojar,et al.  Using MT-ComparEval , 2016 .

[7]  Alexander H. Waibel,et al.  Low Cost Portability for Statistical Machine Translation based on N-gram Frequency and TF-IDF , 2005, IWSLT.

[8]  Philipp Koehn,et al.  Large and Diverse Language Models for Statistical Machine Translation , 2008, IJCNLP.

[9]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[10]  Alon Lavie,et al.  Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability , 2011, ACL.

[11]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[12]  Eleftherios Avramidis,et al.  MT-ComparEval: Graphical evaluation interface for Machine Translation development , 2015, Prague Bull. Math. Linguistics.

[13]  Karin M. Verspoor,et al.  Findings of the WMT 2018 Biomedical Translation Shared Task: Evaluation on Medline test sets , 2018, WMT.

[14]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[15]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[16]  Philipp Koehn,et al.  An Experimental Management System , 2010, Prague Bull. Math. Linguistics.

[17]  Krzysztof Marasek,et al.  Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs , 2015, ArXiv.

[18]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[19]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[22]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[23]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[24]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[25]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[26]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.