Lexical token alignment: experiments, results and applications

Lexical alignment is one of the most challenging tasks in processing and exploiting parallel texts. There are numerous applications that may benefit from an accurate multilingual lexical alignment of biand multi-language corpora. We describe in this paper a hypothesistesting approach to the problem of automatic extraction of translation equivalents from sentence-aligned and tagged parallel corpora. The algorithm was used for automatic extraction of 6 bi-lingual lexicons with English as source language and Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene as the target one, as well as a 7-language lexicon with English as a hub and the other 6 CEE languages. For the experiments described here we used the 7-language aligned corpus based on Orwell’s “1984” novel.

[1]  Nancy Ide,et al.  Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages , 1998, COLING-ACL.

[2]  Kenneth Ward Church,et al.  Identifying Word Correspondences in Parallel Texts , 1991, HLT.

[3]  Jörg Tiedemann Extraction of Translation Equivalents from Parallel Corpora , 1998, NODALIDA.

[4]  Djoerd Hiemstra Deriving a Bilingual Lexicon for Cross-Language Information Retrieval , 1997 .

[5]  Magnus Merkel,et al.  A knowledge-lite approach to word alignment , 2000 .

[6]  Jean Véronis,et al.  Parallel Text Processing , 2000 .

[7]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[8]  Magnus Merkel,et al.  A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts , 1998, ACL.

[9]  A. Cuza Methodological issues in building the Romanian Wordnet and consistency checks in Balkanet ' DQ 7 XILú , 2002 .

[10]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[11]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[12]  Kenneth Ward Church,et al.  Identifying word correspondence in parallel texts , 1991 .

[13]  Dan Tufis,et al.  Empirical Methods for Exploiting Parallel Texts , 2002, Lit. Linguistic Comput..

[14]  Nancy Ide,et al.  Automatic Sense Tagging Using Parallel Corpora , 2001, NLPRS.

[15]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[16]  Chris Brew,et al.  Word-Pair Extraction for Lexicography , 1996 .

[17]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[18]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[19]  I. Dan Melamed Automatic Construction of Clean Broad-Coverage Translation Lexicons , 1996, AMTA.

[20]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[21]  Dan Tufis,et al.  Extracting Multilingual Lexicons from Parallel Corpora , 2004, Comput. Humanit..