Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora

We address the problem of unsupervised and language-pair independent alignment of symmetrical and asymmetrical parallel corpora. Asymmetrical parallel corpora contain a large proportion of 1-to-0/0-to-1 and 1-to-many/many-to-1 sentence correspondences. We have developed a novel approach which is fast and allows us to achieve high accuracy in terms of F1 for the alignment of both asymmetrical and symmetrical parallel corpora. The source code of our aligner and the test sets are freely available.

[1]  Dan Tufis,et al.  Acquis Communautaire Sentence Alignment using Support Vector Machines , 2006, LREC.

[2]  Xiaoyi Ma,et al.  Champollion: A Robust Parallel Text Sentence Aligner , 2006, LREC.

[3]  Stanley F. Chen,et al.  Aligning Sentences in Bilingual Corpora Using Lexical Information , 1993, ACL.

[4]  Masahiko Haruno,et al.  High-performance bilingual text alignment using statistical and dictionary information , 1997, Nat. Lang. Eng..

[5]  Shingo Kuroiwa,et al.  Sentence alignment using P-NNT and GMM , 2007, Comput. Speech Lang..

[6]  Michel Simard,et al.  Bilingual Sentence Alignment: Balancing Robustness and Accuracy , 2004, Machine Translation.

[7]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[8]  Roger K. Moore Computer Speech and Language , 1986 .

[9]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[10]  Michel Simard The BAF: a corpus of english-french bitext , 1998 .

[11]  Dekai Wu,et al.  Aligning a Parallel English-Chinese Corpus Statistically With Lexical Criteria , 1994, ACL.

[12]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[13]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[14]  Masahiko Haruno,et al.  High-Performance Bilingual Text Alignment Using Statistical and Dictionary Information , 1996, ACL.

[15]  Shankar Kumar,et al.  Segmentation and alignment of parallel text for statistical machine translation , 2006, Natural Language Engineering.

[16]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.