论文信息 - Translation of Biomedical Documents with Focus on Spanish-English

Translation of Biomedical Documents with Focus on Spanish-English

For the WMT 2018 shared task of translating documents pertaining to the Biomedical domain, we developed a scoring formula that uses an unsophisticated and effective method of weighting term frequencies and was integrated in a data selection pipeline. The method was applied on five language pairs and it performed best on Portuguese-English, where a BLEU score of 41.84 placed it third out of seven runs submitted by three institutions. In this paper, we describe our method and results with a special focus on Spanish-English where we compare it against a state-of-the-art method. Our contribution to the task lies in introducing a fast, unsupervised method for selecting domain-specific data for training models which obtain good results using only 10% of the general domain data.

Wolfgang Menzel | Mirela-Stefania Duma | W. Menzel | M. Duma | Mirela-Stefania Duma

[1] Martin Porter,et al. Snowball: A language for stemming algorithms , 2001 .

[2] Hermann Ney,et al. Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3] Alex Waibel,et al. Adaptation of the translation model for statistical machine translation based on information retrieval , 2005, EAMT.

[4] William D. Lewis,et al. Intelligent Selection of Language Model Training Data , 2010, ACL.

[5] Mariana L. Neves,et al. The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine , 2016, LREC.

[6] Ondrej Bojar,et al. Using MT-ComparEval , 2016 .

[7] Alexander H. Waibel,et al. Low Cost Portability for Statistical Machine Translation based on N-gram Frequency and TF-IDF , 2005, IWSLT.

[8] Philipp Koehn,et al. Large and Diverse Language Models for Statistical Machine Translation , 2008, IJCNLP.

[9] Christiane Fellbaum,et al. Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[10] Alon Lavie,et al. Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability , 2011, ACL.

[11] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[12] Eleftherios Avramidis,et al. MT-ComparEval: Graphical evaluation interface for Machine Translation development , 2015, Prague Bull. Math. Linguistics.

[13] Karin M. Verspoor,et al. Findings of the WMT 2018 Biomedical Translation Shared Task: Evaluation on Medline test sets , 2018, WMT.

[14] Jörg Tiedemann,et al. Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.