Mining Term Translation from Domain Restricted Comparable Corpora

Several approaches have been proposed in the literature for extracting word translations from comparable corpora, almost all of them based on the idea of context similarity. This work addresses the aforementioned issue for the Basque-Spanish pair in a popular science domain. The main tasks our experiments focus on include: designing a method to combine some of the existing approaches; adapting this method to a popular science domain for the Basque-Spanish pair; and analyzing the performance of different approaches both for translating the contexts of the words and computing the similarity between contexts. We finally evaluate the different prototypes by calculating the precision for different cutoffs. The yielded results show the validity of the designed hybrid method, as well as the improvement obtained by using the probabilistic models we propose for computing the similarity between contexts.

[1]  Fatiha Sadat,et al.  An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction , 2002, COLING.

[2]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[3]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[4]  Changning Huang,et al.  Improving query translation for cross-language information retrieval using statistical models , 2001, SIGIR '01.

[5]  Pascale Fung,et al.  Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus , 1995, VLC@ACL.

[6]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[7]  David A. Hull Using Structured Queries for Disambiguation in Cross-Language Information Retrieval , 1997 .

[8]  Yaser Al-Onaizan,et al.  Machine Transliteration of Names in Arabic Texts , 2002, SEMITIC@ACL.

[9]  Iñaki San Vicente,et al.  Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain , 2008 .

[10]  Igor Leturia,et al.  Kimatu , a tool for cleaning non-content text parts from HTML docs , 2009 .

[11]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[12]  Kyo Kageura,et al.  Bilingual Terminology Mining - Using Brain, not brawn comparable corpora , 2007, ACL.

[13]  Hwee Tou Ng,et al.  Mining New Word Translations from Comparable Corpora , 2004, COLING.

[14]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[15]  Jianfeng Gao,et al.  A study of statistical models for query translation: finding a good unit of translation , 2006, SIGIR.

[16]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language retrieval , 1998, SIGIR '98.

[17]  Yi Liu,et al.  A maximum coherence model for dictionary-based cross-language information retrieval , 2005, SIGIR '05.

[18]  Adam Kilgarriff,et al.  Measures for Corpus Similarity and Homogeneity , 1998, EMNLP.

[19]  Hsin-Hsi Chen,et al.  Resolving Translation Ambiguity and Target Polysemy in Cross-Language Information Retrieval , 1999, ACL.

[20]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[21]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.