Extracting Multilingual Lexicons from Parallel

The paper describes our recent developments in automatic extraction of translation equivalents from parallel corpora. We describe three increasingly complex algorithms: a simple baseline iterative method, and two non-iterative more elaborated versions. While the baseline algorithm is mainly described for illustrative purposes, the non-iterative algorithms outline the use of different working hypotheses which may be motivated by different kinds of applications and to some extent by the languages concerned. The first two algorithms rely on cross-lingual POS preservation, while with the third one POS invariance is not an extraction condition. The evaluation of the algorithms was conducted on three different corpora and several pairs of languages.

[1]  Dan Tufis A Cheap and Fast Way to Build Useful Translation Lexicons , 2002, COLING.

[2]  Ana-Maria Barbu,et al.  Corpora and Corpus-Based Morpho-Lexical Processing , 1997 .

[3]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[4]  Dan Tufis,et al.  Revealing Translators' Knowledge: Statistical Methods in Constructing Practical Translation Lexicons for Language and Speech Processing , 2002, Int. J. Speech Technol..

[5]  Magnus Merkel,et al.  A knowledge-lite approach to word alignment , 2000 .

[6]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[7]  Nancy Ide,et al.  The MULTEXT East corpus , 1998, LREC.

[8]  Nancy Priest-Dorman Greg Ide,et al.  Corpus Encoding Standard (CES) , 2000 .

[9]  Nancy Ide,et al.  Standardised specifications, development and assessment of large morpho-lexical resources for six central and eastern european languages , 1998, LREC.

[10]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[11]  Mitchell Marcus,et al.  Empirical Methods for Exploiting Parallel Texts , 2001 .

[12]  Djoerd Hiemstra Deriving a Bilingual Lexicon for Cross-Language Information Retrieval , 1997 .

[13]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[14]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[15]  Dan Tufis,et al.  TREQ-AL: A word alignment system with limited language resources , 2003, ParallelTexts@NAACL-HLT.

[16]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[17]  Nancy Ide,et al.  Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages , 1998, COLING-ACL.

[18]  Kenneth Ward Church,et al.  Identifying word correspondence in parallel texts , 1991 .

[19]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[20]  Dan Tufis,et al.  Using a Large Set of EAGLES-compliant Morpho-syntactic Descriptors as a Tagset for Probabilistic Tagging , 2000, LREC.