论文信息 - Deriving a Bilingual Lexicon for Cross-Language Information Retrieval

Deriving a Bilingual Lexicon for Cross-Language Information Retrieval

In this paper we describe a systematic approach to derive a bilingual lexicon automatically from parallel corpora. Following this approach, a lexicon was derived from the English and Dutch version of the Agenda 21 corpus. With the lexicon and a part of the corpus that was not used to derive the lexicon, a bilingual retrieval environment was build. Recall and precision of monolingual (Dutch) retrieval was compared to recall and precision of bilingual (Dutch-to-English) retrieval. An experiment was conducted with the help of eight naive users who formulated queries and judged the relevance of retrieved fragments. The experiment shows 78% precision and 51% relative recall of monolingual retrieval, against 67% precision and 82% relative recall of bilingual retrieval.

Djoerd Hiemstra

[1] Djoerd Hiemstra,et al. Using statistical methods to create a bilingual dictionary , 1996 .

[2] Kenneth Ward Church,et al. A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[3] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.