Sentence Alignment of Hungarian-English Parallel Corpora Using a Hybrid Algorithm

We present an efficient hybrid method for aligning sentences with their translations in a parallel bilingual corpus. The new algorithm is composed of a length-based and anchor matching method that uses Named Entity recognition. This algorithm combines the speed of length-based models with the accuracy of anchor finding methods. The accuracy of finding cognates for Hungarian-English language pair is extremely low, hence we thought of using a novel approach that includes Named Entity recognition. Due to the well selected anchors it was found to outperform the best two sentence alignment algorithms so far published for the Hungarian-English language pair.

[1]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Stanley F. Chen,et al.  Aligning Sentences in Bilingual Corpora Using Lexical Information , 1993, ACL.

[4]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[5]  János Csirik,et al.  The Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus , 2004, TSD.

[6]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[7]  Shingo Kuroiwa,et al.  Probabilistic Neural Network Based English-Arabic Sentence Alignment , 2006, CICLing.

[8]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[9]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[10]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[11]  Jason S. Chang,et al.  Adaptive Bilingual Sentence Alignment , 2002, AMTA.

[12]  Stephen D. Richardson Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users , 2002 .

[13]  KocsorAndrás,et al.  Sentence alignment of Hungarian-English parallel corpora using a hybrid algorithm , 2008 .

[14]  Nigel Collier,et al.  An Experiment in Hybrid Dictionary and Statistical Sentence Alignment , 1998, COLING-ACL.

[15]  János Csirik,et al.  A highly accurate Named Entity corpus for Hungarian , 2006, LREC.

[16]  András Kocsor,et al.  A Multilingual Named Entity Recognition System Using Boosting and C4.5 Decision Tree Learning Algorithms , 2006, Discovery Science.

[17]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[18]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[19]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[20]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.