Extracting Multilingual Lexicons from Parallel Corpora

The paper describes our recent developments in automatic extraction of translation equivalents from parallel corpora. We describe three increasingly complex algorithms: a simple baseline iterative method, and two non-iterative more elaborated versions. While the baseline algorithm is mainly described for illustrative purposes, the non-iterative algorithms outline the use of different working hypotheses which may be motivated by different kinds of applications and to some extent by the languages concerned. The first two algorithms rely on cross-lingual POS preservation, while with the third one POS invariance is not an extraction condition. The evaluation of the algorithms was conducted on three different corpora and several pairs of languages.

[1]  Kenneth Ward Church,et al.  Identifying Word Correspondences in Parallel Texts , 1991, HLT.

[2]  Jean Véronis,et al.  Parallel Text Processing , 2000 .

[3]  Dan Tufis,et al.  TREQ-AL: A word alignment system with limited language resources , 2003, ParallelTexts@NAACL-HLT.

[4]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[5]  Dan Tufis,et al.  Revealing Translators' Knowledge: Statistical Methods in Constructing Practical Translation Lexicons for Language and Speech Processing , 2002, Int. J. Speech Technol..

[6]  Nancy Ide,et al.  Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages , 1998, COLING-ACL.

[7]  Stamou Sofia Oflazer,et al.  BALKANET: A Multilingual Semantic Network for Balkan Languages , 2001 .

[8]  Ana-Maria Barbu,et al.  Corpora and Corpus-Based Morpho-Lexical Processing , 1997 .

[9]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[10]  Nancy Ide,et al.  The MULTEXT East corpus , 1998, LREC.

[11]  Nancy Priest-Dorman Greg Ide,et al.  Corpus Encoding Standard (CES) , 2000 .

[12]  Magnus Merkel,et al.  A knowledge-lite approach to word alignment , 2000 .

[13]  Kenneth Ward Church,et al.  Identifying word correspondence in parallel texts , 1991 .

[14]  Dan Tufis A Cheap and Fast Way to Build Useful Translation Lexicons , 2002, COLING.

[15]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[16]  Nancy Ide,et al.  Standardised specifications, development and assessment of large morpho-lexical resources for six central and eastern european languages , 1998, LREC.

[17]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[18]  Mitchell Marcus,et al.  Empirical Methods for Exploiting Parallel Texts , 2001 .

[19]  Tamás Váradi,et al.  Principled Hidden Tagset Design for Tiered Tagging of Hungarian , 2000, LREC.

[20]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[21]  Djoerd Hiemstra Deriving a Bilingual Lexicon for Cross-Language Information Retrieval , 1997 .

[22]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[23]  Dan Tufis,et al.  Using a Large Set of EAGLES-compliant Morpho-syntactic Descriptors as a Tagset for Probabilistic Tagging , 2000, LREC.

[24]  Magnus Merkel,et al.  A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts , 1998, ACL.

[25]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.