Measuring Comparability of Multilingual Corpora Extracted from Wikip edia

Comparable corpora can be used for many linguistic tasks such as bilin- gual lexicon extraction. By improving the quality of comparable corpora, we improve the quality of the extraction. This article describes some strategies to build compara- ble corpora from Wikipedia and proposes a measure of comparability. Experiments

[1]  Benno Stein,et al.  A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[2]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[3]  José Ramom Pichel Campos,et al.  Learning Spanish-Galician Translation Equivalents Using a Comparable Corpus and a Bilingual Dictionary , 2008, CICLing.

[4]  Wolfgang Nejdl,et al.  Extracting Semantics Relationships between Wikipedia Categories , 2006, SemWiki.

[5]  Simone Paolo Ponzetto,et al.  Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia , 2009, IJCAI.

[6]  Pablo Gamallo Otero,et al.  Wikipedia as Multilingual Source of Comparable Corpora , 2011 .

[7]  Xabier Saralegi Urizar,et al.  Similitud entre documentos multilingües de carácter científico-técnico en un entorno Web , 2007 .

[8]  Junichi Tsujii,et al.  Bilingual Dictionary Extraction from Wikipedia , 2009, MTSUMMIT.

[9]  Gerhard Weikum,et al.  MENTA: inducing multilingual taxonomies from wikipedia , 2010, CIKM '10.

[10]  Éric Gaussier,et al.  Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora , 2010, COLING.

[11]  Elena Filatova,et al.  Directions for Exploiting Asymmetries in Multilingual Wikipedia , 2009 .

[12]  Michael Skinner,et al.  Information arbitrage across multi-lingual Wikipedia , 2009, WSDM '09.

[13]  Francis M. Tyers,et al.  Extracting bilingual word pairs from Wikipedia , 2008 .

[14]  Belinda Maia What are comparable corpora , 2003 .

[15]  van Gerardus Noord,et al.  Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010) , 2010 .