论文信息 - First Steps Towards Coverage-Based Document Alignment

First Steps Towards Coverage-Based Document Alignment

In this paper we describe a method for selecting pairs of parallel documents (documents that are a translation of each other) from a large collection of documents obtained from the web. Our approach is based on a coverage score that reflects the number of distinct bilingual phrase pairs found in each pair of documents, normalized by the total number of unique phrases found in them. Since parallel documents tend to share more bilingual phrase pairs than non-parallel documents, our alignment algorithm selects pairs of documents with the maximum coverage score from all possible pairings involving either one of the two documents.

José Gabriel Pereira Lopes | Luís Gomes | Luís Manuel dos Santos Gomes | J. Lopes

[1] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[2] Rico Sennrich,et al. MT-based Sentence Alignment for OCR-generated Parallel Texts , 2010, AMTA.

[3] José Gabriel Pereira Lopes,et al. First Steps Towards Coverage-Based Sentence Alignment , 2016, LREC.

[4] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[5] Mikel L. Forcada,et al. Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor , 2010, Prague Bull. Math. Linguistics.