Bilingual Dictionary Extraction from Wikipedia

The way of mining comparable corpora and the strategy of dictionary extraction are two essential elements of bilingual dictionary extraction from comparable corpora. This paper first proposes a method, which uses the interlanguage link in Wikipedia, to build comparable corpora. The large scale of Wikipedia ensures the quantity of collected comparable corpora. Besides, because the inter-language link is created by article author, the quality of collected corpora can also be guaranteed. After that, this paper presents an approach, which combines context heterogeneity similarity and dependency heterogeneity similarity, to extract bilingual dictionary from the collected comparable corpora. Experimental results show that because of combining the advantages of context heterogeneity similarity and dependency heterogeneity similarity appropriately, the proposed approach outperforms both the two individual approaches.

[1]  Dekai Wu,et al.  Learning an English-Chinese Lexicon from a Parallel Corpus , 1994, AMTA.

[2]  Pascale Fung,et al.  Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus , 1995, VLC@ACL.

[3]  Gregory Grefenstette The Problem of Cross-Language Information Retrieval , 1998 .

[4]  Pascale Fung,et al.  A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora , 1998, AMTA.

[5]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[6]  Pierre Zweigenbaum,et al.  Looking for Candidate Translational Equivalents in Specialized, Comparable Corpora , 2002, COLING.

[7]  Takaaki Tanaka Measuring the Similarity between Compound Nouns in Different Languages Using Non-Parallel Corpora , 2002, COLING.

[8]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[9]  Yuji Matsumoto MaltParser: A language-independent system for data-driven dependency parsing , 2005 .

[10]  Emmanuel Morin,et al.  French-English Terminology Extraction from Comparable Corpora , 2005, IJCNLP.

[11]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[12]  Satoshi Sato,et al.  Compiling French-Japanese Terminologies from the Web , 2006, EACL.

[13]  Paolo Rosso,et al.  Mining Knowledge fromWikipedia for the Question Answering task , 2006, LREC.

[14]  Tetsuji Nakagawa,et al.  A Hybrid Approach to Word Segmentation and POS Tagging , 2007, ACL.

[15]  Pablo Gamallo Otero Learning bilingual lexicons from comparable English and Spanish corpora , 2007, MTSUMMIT.

[16]  Takahiro Hara,et al.  Wikipedia Mining for an Association Web Thesaurus Construction , 2007, WISE.

[17]  Rada Mihalcea,et al.  Using Wikipedia for Automatic Word Sense Disambiguation , 2007, NAACL.

[18]  Kyo Kageura,et al.  Bilingual Terminology Mining - Using Brain, not brawn comparable corpora , 2007, ACL.

[19]  Emmanuel Morin,et al.  An Effective Compositional Model for Lexical Alignment , 2008, IJCNLP.

[20]  Takahiro Hara,et al.  An Approach for Extracting Bilingual Terminology from Wikipedia , 2008, DASFAA.

[21]  Pablo Gamallo Evaluating Two Different Methods for the Task of Extracting Bilingual Lexicons from Comparable Corpora , 2008 .

[22]  Iñaki San Vicente,et al.  Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain , 2008 .

[23]  Kun Yu,et al.  Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity , 2009, HLT-NAACL.