Finding Missing Cross-Language Links in Wikipedia

Wikipedia is a public encyclopedia composed of millions of articles written daily by volunteer authors from different regions of the world. The articles contain links called cross-language links which relate corresponding articles across different languages. This feature is extremely useful for applications that work with automatic translation and multilingual information retrieval as it allows the assembly of comparable corpora. Thus, it is important to have a mechanism that automatically creates such links. This has been motivating the development of techniques to identify missing cross-language links. In this article, we present CLLFinder, an approach for finding missing cross-language links. The approach makes use of the links between categories and of the transitivity between existing cross-language links, as well as textual features extracted from the articles. Experiments using one million articles from the English and Portuguese Wikipedias attest the viability of CLLFinder. The results show that our approach has a recall of 96% and a precision of 98%, outperforming the baseline system, even though we employ simpler and fewer features.

[1]  Jong-Hoon Oh,et al.  Enriching Multilingual Language Resources by Discovering Missing Cross-Language Links in Wikipedia , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[2]  Felix Naumann,et al.  Cross-lingual entity matching and infobox alignment in Wikipedia , 2013, Inf. Syst..

[3]  Philipp Cimiano,et al.  Enriching the crosslingual link structure of Wikipedia - A classification-based approach , 2008, AAAI 2008.

[4]  Juliana Freire,et al.  Multilingual Schema Matching for Wikipedia Infoboxes , 2011, Proc. VLDB Endow..

[5]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[6]  Benno Stein,et al.  A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[7]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[8]  Takahiro Hara,et al.  Improving the extraction of bilingual terminology from Wikipedia , 2009, TOMCCAP.

[9]  Michael Skinner,et al.  Information arbitrage across multi-lingual Wikipedia , 2009, WSDM '09.

[10]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[11]  Gosse Bouma,et al.  Cross-lingual Alignment and Completion of Wikipedia Templates , 2009 .

[12]  Nigel Shadbolt,et al.  Discovering Cross-language Links in Wikipedia through Semantic Relatedness , 2012, ECAI.

[13]  Philipp Cimiano,et al.  Cross-language Information Retrieval with Explicit Semantic Analysis , 2008, CLEF.

[14]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[15]  Gerhard Weikum,et al.  Untangling the Cross-Lingual Link Structure of Wikipedia , 2010, ACL.