Parallel corpora for medium density languages

The choice of natural language technology appropriate for a given language is greatly impacted by density (availability of digitally stored material). More than half of the world speaks medium density languages, yet many of the methods appropriate for high or low density languages yield suboptimal results when applied to the medium density case. In this paper we describe a general methodology for rapidly collecting, building, and aligning parallel corpora for medium density languages, illustrating our main points on the case of Hungarian, Romanian, and Slovenian. We also describe and evaluate the hybrid sentence alignment method we are using.

[1]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[2]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[3]  Stanley F. Chen,et al.  Aligning Sentences in Bilingual Corpora Using Lexical Information , 1993, ACL.

[4]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[5]  Michel Simard,et al.  Bilingual Sentence Alignment: Balancing Robustness and Accuracy , 2004, Machine Translation.

[6]  Philip Resnik,et al.  Parallel strands: a preliminary investigation into mining the Web for bilingual text , 1998, AMTA.

[7]  Nancy Ide,et al.  Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages , 1998, COLING-ACL.

[8]  Jian-Yun Nie,et al.  Automatic construction of parallel English-Chinese corpus for cross-language information retrieval , 2000, ANLP.

[9]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[10]  Andrei Mikheev,et al.  Periods, Capitalized Words, etc. , 2002, CL.

[11]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[12]  Joel D. Martin,et al.  Aligning and Using an English-Inuktitut Parallel Corpus , 2003, ParallelTexts@NAACL-HLT.

[13]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[14]  Jörg Tiedemann,et al.  The OPUS corpus : parallel and free , 2004 .

[15]  András Kornai,et al.  Creating Open Language Resources for Hungarian , 2004, LREC.

[16]  András Kornai,et al.  Hunmorph: Open Source Word Analysis , 2005, ACL 2005.