Untangling the Cross-Lingual Link Structure of Wikipedia

Wikipedia articles in different languages are connected by interwiki links that are increasingly being recognized as a valuable source of cross-lingual information. Unfortunately, large numbers of links are imprecise or simply wrong. In this paper, techniques to detect such problems are identified. We formalize their removal as an optimization task based on graph repair operations. We then present an algorithm with provable properties that uses linear programming and a region growing technique to tackle this challenge. This allows us to transform Wikipedia into a much more consistent multilingual register of the world's entities and concepts.

[1]  Antonio Toral,et al.  Applying Wikipedia's Multilingual Knowledge to Cross-Lingual Question Answering , 2007, NLDB.

[2]  Anthony K. H. Tung,et al.  Comparing Stars: On Approximating Graph Edit Distance , 2009, Proc. VLDB Endow..

[3]  Carina Silberer,et al.  Building a Multilingual Lexical Resource for Named Entity Disambiguation, Translation and Transliteration , 2008, LREC.

[4]  Djoerd Hiemstra,et al.  WikiTranslate: Query Translation for Cross-lingual Information Retrieval using only Wikipedia , 2008, CLEF.

[5]  Subhash Khot,et al.  On the power of unique 2-prover 1-round games , 2002, Proceedings 17th IEEE Annual Conference on Computational Complexity.

[6]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[7]  Yuval Rabani,et al.  ON THE HARDNESS OF APPROXIMATING MULTICUT AND SPARSEST-CUT , 2005, 20th Annual IEEE Conference on Computational Complexity (CCC'05).

[8]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[9]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.

[11]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[12]  Michael Langberg,et al.  The multi-multiway cut problem , 2007, Theor. Comput. Sci..

[13]  Gosse Bouma,et al.  Cross-lingual Alignment and Completion of Wikipedia Templates , 2009 .

[14]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, STOC '84.

[15]  Philipp Cimiano,et al.  Enriching the crosslingual link structure of Wikipedia - A classification-based approach , 2008, AAAI 2008.

[16]  Frank Thomson Leighton,et al.  Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms , 1999, JACM.

[17]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[18]  Dan Roth,et al.  Learning better transliterations , 2009, CIKM.

[19]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[20]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[21]  Mihalis Yannakakis,et al.  Approximate Max-Flow Min-(Multi)Cut Theorems and Their Applications , 1996, SIAM J. Comput..

[22]  Michael Skinner,et al.  Information arbitrage across multi-lingual Wikipedia , 2009, WSDM '09.

[23]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.