Using Term Position Similarity and Language Modeling for Bilingual Document Alignment

The WMT Bilingual Document Alignment Task requires systems to assign source pages to their “translations”, in a big space of possible pairs. We present four methods: The first one uses the term position similarity between candidate document pairs. The second method requires automatically translated versions of the target text, and matches them with the candidates. The third and fourth methods try to overcome some of the challenges presented by the nature of the corpus, by considering the string similarity of source URL and candidate URL, and combining the first two approaches.

[1]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[2]  Jinxi Xu,et al.  Evaluating a probabilistic model for cross-lingual information retrieval , 2001, SIGIR '01.

[3]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[4]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[5]  Xiaoyi Ma,et al.  BITS: a method for bilingual text search over the Web , 1999, MTSUMMIT.

[6]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[7]  Richard M. Schwartz,et al.  Language and Translation Model Adaptation using Comparable Corpora , 2008, EMNLP.

[8]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[9]  Holger Schwenk,et al.  Parallel sentence generation from comparable corpora for improved SMT , 2011, Machine Translation.

[10]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[11]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[12]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[13]  Thomas Brox,et al.  Maximum Likelihood Estimation , 2019, Time Series Analysis.

[14]  David A. Smith,et al.  A Minimally Supervised Approach for Detecting and Ranking Document Translation Pairs , 2011, WMT@EMNLP.