Automatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval

OBJECTIVES We present in this article experiments on multi-language information extraction and access in the medical domain. For such applications, multilingual terminology plays a crucial role when working on specialized languages and specific domains. MATERIAL AND METHODS We propose firstly a method for enriching multilingual thesauri which extracts new terms from parallel corpora, and secondly, a new approach for bilingual lexicon extraction from comparable corpora, which uses a bilingual thesaurus as a pivot. We illustrate their use in multi-language information retrieval (English/German) in the medical domains. RESULTS Our experiments show that these automatically extracted bilingual lexicons are accurate enough (85% precision for term extraction) for semi-automatically enriching mono- or bi-lingual thesauri such as the universal medical language system, and that their use in cross-language information retrieval significantly improves the retrieval performance (from 22 to 40% average precision) and clearly outperforms existing bilingual lexicon resources (both general lexicons and specialized ones). CONCLUSION We show in this paper first that bilingual lexicon extraction from parallel corpora in the medical domain could lead to accurate, specialized lexicons, which can be used to help enrich existing thesauri and second that bilingual lexicons extracted from comparable corpora outperform general bilingual resources for cross-language information retrieval.

[1]  David A. Hull,et al.  Term Alignment in Use: Machine-Aided Human Translation , 2000 .

[2]  Fatiha Sadat,et al.  An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction , 2002, COLING.

[3]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[4]  Ingeborg Blank,et al.  Terminology extraction from parallel technical texts , 2000 .

[5]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[6]  Paul Buitelaar,et al.  Semantic annotation for concept-based cross-language medical information retrieval , 2002, Int. J. Medical Informatics.

[7]  David A. Hull,et al.  Term alignment in use , 2000 .

[8]  Shigeru Masuyama,et al.  Identifying Translations of Compound Nouns Using Non-aligned Corpora , 1999 .

[9]  Éric Gaussier Flow Network Models for Word Alignment and Terminology Extraction from Bilingual Corpora , 1998, COLING-ACL.

[10]  Paul Buitelaar,et al.  Extending Synsets with Medical Terms , 2002 .

[11]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[12]  Kumiko Tanaka-Ishii,et al.  Extraction of Lexical Translations from Non-Aligned Corpora , 1996, COLING.

[13]  Dominic Widdows,et al.  Using Parallel Corpora to enrich Multilingual Lexical Resources , 2002, LREC.

[14]  David A. Hull Automating the construction of bilingual terminology lexicons , 1997 .

[15]  Ulrich Heid,et al.  A linguistic bootstrapping approach to the extraction of term candidates from German Text , 1998 .

[16]  Turid Hedlund,et al.  Compounds in dictionary-based cross-language information retrieval , 2002, Inf. Res..

[17]  Mona T. Diab,et al.  A statistical word-level translation model for comparable corpora , 2000 .

[18]  Alon Itai,et al.  Word Sense Disambiguation Using a Second Language Monolingual Corpus , 1994, CL.

[19]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[20]  Jean-Michel Renders,et al.  Assessing Automatically Extracted Bilingual Lexicons for CLIR in Vertical Domains: XRCE Participation in the GIRT Track of CLEF 2002 , 2002, CLEF.

[21]  Jean Véronis,et al.  Parallel Text Processing , 2000 .

[22]  P. Holland,et al.  Discrete Multivariate Analysis. , 1976 .

[23]  I. Dan Melamed A Word-to-Word Model of Translational Equivalence , 1997, ACL.

[24]  Jean Véronis,et al.  Parallel text processing :alignment and use of translationcorpora , 2000 .

[25]  Djoerd Hiemstra,et al.  Using statistical methods to create a bilingual dictionary , 1996 .

[26]  Pascale Fung,et al.  A statistical view on bilingual lexicon extraction , 1998, AMTA.

[27]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[28]  Kalervo Järvelin,et al.  The Effects of Conjunction, Facet Structure, and Dictionary Combinations in Concept-Based Cross-Language Retrieval , 2004, Information Retrieval.