Computational bilingual lexicography: automatic extraction of translation dictionaries

The paper describes a simple but very effective approach to extraction translation equivalents from parallel corpora. We briefly present the multilingual parallel corpus used in our experiments and then describe the pre-processing steps, a baseline iterative method, and the actual algorithm. The evaluation for the two algorithms is presented in some details in terms of precision, recall and processing time. The baseline algorithm was used to extract 6 bilingual lexicons and it was evaluated on four of them. The second algorithm was evaluated only on the Romanian-English noun lexicon. An analysis of the missed or wrong translation equivalents figured out various factors, both intrinsic, due to the method and extrinsic due to the working data (accuracy of the pre-processing, quality of translation, bitext language relatedness). We conclude by discussing the merits and the drawbacks of our method in comparison with other works and comment on further developments.

[1]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[2]  Dan Tufis,et al.  Using a Large Set of EAGLES-compliant Morpho-syntactic Descriptors as a Tagset for Probabilistic Tagging , 2000, LREC.

[3]  Jörg Tiedemann Extraction of Translation Equivalents from Parallel Corpora , 1998, NODALIDA.

[4]  Tamás Váradi,et al.  Principled Hidden Tagset Design for Tiered Tagging of Hungarian , 2000, LREC.

[5]  Djoerd Hiemstra Deriving a Bilingual Lexicon for Cross-Language Information Retrieval , 1997 .

[6]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[7]  Dan Tufis,et al.  Tagging romanian texts: a case study for QTAG, a language independent probabilistic tagger , 1998 .

[8]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[9]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[10]  Jean Véronis,et al.  Parallel Text Processing , 2000 .

[11]  Dan Tufis Tiered Tagging and Combined Language Models Classifiers , 1999, TSD.

[12]  Nancy Ide,et al.  Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages , 1998, COLING-ACL.

[13]  I. Dan Melamed A Word-to-Word Model of Translational Equivalence , 1997, ACL.

[14]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[15]  Kenneth Ward Church,et al.  Identifying Word Correspondences in Parallel Texts , 1991, HLT.

[16]  Magnus Merkel,et al.  A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts , 1998, ACL.

[17]  I. Dan Melamed Automatic Construction of Clean Broad-Coverage Translation Lexicons , 1996, AMTA.

[18]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[19]  Nancy Ide,et al.  The MULTEXT East corpus , 1998, LREC.

[20]  Chris Brew,et al.  Word-Pair Extraction for Lexicography , 1996 .

[21]  Nancy Ide,et al.  Standardised specifications, development and assessment of large morpho-lexical resources for six central and eastern european languages , 1998, LREC.

[22]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[23]  Kenneth Ward Church,et al.  Identifying word correspondence in parallel texts , 1991 .

[24]  I. Dan Melamed Empirical Methods for MT Lexicon Development , 1998, AMTA.

[25]  Ana-Maria Barbu,et al.  Corpora and Corpus-Based Morpho-Lexical Processing , 1997 .

[26]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[27]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[28]  Magnus Merkel,et al.  A knowledge-lite approach to word alignment , 2000 .

[29]  I. Dan Melamed,et al.  Word-to-Word Models of Translational Equivalence , 1998, ArXiv.

[30]  Nancy Ide,et al.  Encoding dictionaries , 1995, Comput. Humanit..