A cost-effective lexical acquisition process for large-scale thesaurus translation

Thesauri and controlled vocabularies facilitate access to digital collections by explicitly representing the underlying principles of organization. Translation of such resources into multiple languages is an important component for providing multilingual access. However, the specificity of vocabulary terms in most thesauri precludes fully-automatic translation using general-domain lexical resources. In this paper, we present an efficient process for leveraging human translations to construct domain-specific lexical resources. This process is illustrated on a thesaurus of 56,000 concepts used to catalog a large archive of oral histories. We elicited human translations on a small subset of concepts, induced a probabilistic phrase dictionary from these translations, and used the resulting resource to automatically translate the rest of the thesaurus. Two separate evaluations demonstrate the acceptability of the automatic translations and the cost-effectiveness of our approach.

[1]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[2]  Carlos Dafonte,et al.  Multilingual authoring through an artificial language , 2001, MTSUMMIT.

[3]  ArakiKenji,et al.  Automatic extraction of bilingual word pairs using inductive chain learning in various languages , 2006 .

[4]  Jimmy J. Lin,et al.  Leveraging Reusability: Cost-Effective Lexical Acquisition for Large-Scale Ontology Translation , 2006, ACL.

[5]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[6]  Jean-Michel Renders,et al.  Automatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval , 2005, Artif. Intell. Medicine.

[7]  Hiroyuki Kaji,et al.  Extracting Word Correspondences from Bilingual Corpora Based on Word Co-occurrence Information , 1996, COLING.

[8]  Hiroshi Echizen-ya,et al.  Automatic extraction of bilingual word pairs using inductive chain learning in various languages , 2006, Inf. Process. Manag..

[9]  Masatoshi Yoshikawa,et al.  Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora , 2003, SIGIR '03.

[10]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[11]  Kumiko Tanaka-Ishii,et al.  Extraction of Lexical Translations from Non-Aligned Corpora , 1996, COLING.

[12]  Jan Hajic,et al.  Prague Czech-English Dependency Treebank. Syntactically Annotated Resources for Machine Translation , 2004, LREC.

[13]  Bonnie J. Dorr,et al.  Enhancing automatic acquisition of the thematic structure in a large-scale lexicon for Mandarin Chinese , 1998, AMTA.

[14]  Jimmy Lin,et al.  Leveraging Recurrent Phrase Structure in Large-scale Ontology Translation , 2006, EAMT.

[15]  Bhuvana Ramabhadran,et al.  Supporting access to large digital oral history archives , 2002, JCDL '02.