MENTA: inducing multilingual taxonomies from wikipedia

In recent years, a number of projects have turned to Wikipedia to establish large-scale taxonomies that describe orders of magnitude more entities than traditional manually built knowledge bases. So far, however, the multilingual nature of Wikipedia has largely been neglected. This paper investigates how entities from all editions of Wikipedia as well as WordNet can be integrated into a single coherent taxonomic class hierarchy. We rely on linking heuristics to discover potential taxonomic relationships, graph partitioning to form consistent equivalence classes of entities, and a Markov chain-based ranking approach to construct the final taxonomy. This results in MENTA (Multilingual Entity Taxonomy), a resource that describes 5.4 million entities and is presumably the largest multilingual lexical knowledge base currently available.

[1]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[2]  Gosse Bouma,et al.  Cross-lingual Alignment and Completion of Wikipedia Templates , 2009 .

[3]  Eneko Agirre,et al.  Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation , 2008, LREC.

[4]  Kevin Knight,et al.  Building a Large-Scale Knowledge Base for Machine Translation , 1994, AAAI.

[5]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[6]  Xiang Zhang,et al.  Finding Important Vocabulary Within Ontology , 2006, ASWC.

[7]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[8]  Christiane Fellbaum,et al.  Connecting the Universal to the Specific: Towards the Global Grid , 2007, IWIC.

[9]  Dan Roth,et al.  Learning better transliterations , 2009, CIKM.

[10]  Suresh Manandhar,et al.  Taxonomy Learning Using Word Sense Induction , 2010, HLT-NAACL.

[11]  Oren Etzioni,et al.  Lexical Translation with Application to Image Search on the Web , 2007 .

[12]  Oren Etzioni,et al.  Compiling a Massive, Multilingual Dictionary via Probabilistic Inference , 2009, ACL.

[13]  Carina Silberer,et al.  Building a Multilingual Lexical Resource for Named Entity Disambiguation, Translation and Transliteration , 2008, LREC.

[14]  Eric Nyberg,et al.  Semantic Extensions of the Ephyra QA System for TREC 2007 , 2007, TREC.

[15]  Gerhard Weikum,et al.  Towards a universal wordnet by learning from combined evidence , 2009, CIKM.

[16]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[17]  Adam Pease,et al.  Towards a standard upper ontology , 2001, FOIS.

[18]  Piek Vossen,et al.  EuroWordNet: A multilingual database with lexical semantic networks , 1998, Springer Netherlands.

[19]  Zhiguo Gong,et al.  Web Query Expansion by WordNet , 2005, DEXA.

[20]  Fabian M. Suchanek,et al.  ESTER: efficient search on text, entities, and relations , 2007, SIGIR.

[21]  Antonio Toral,et al.  Named Entity WordNet , 2008, LREC.

[22]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[23]  Partha Pratim Talukdar,et al.  Weakly-Supervised Acquisition of Labeled Class Instances using Graph Random Walks , 2008, EMNLP.

[24]  David Yarowsky,et al.  Minimally Supervised Multilingual Taxonomy and Translation Lexicon Induction , 2008, IJCNLP.

[25]  Richard M. Karp,et al.  Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems , 1972, Combinatorial Optimization.

[26]  Philipp Cimiano,et al.  Enriching the crosslingual link structure of Wikipedia - A classification-based approach , 2008, AAAI 2008.

[27]  Simone Paolo Ponzetto,et al.  WikiTaxonomy: A Large Scale Knowledge Resource , 2008, ECAI.

[28]  Jong-Hoon Oh,et al.  Enriching Multilingual Language Resources by Discovering Missing Cross-Language Links in Wikipedia , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[29]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[30]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[32]  Ian H. Witten,et al.  A knowledge-based search engine powered by wikipedia , 2007, CIKM '07.

[33]  Michael Strube,et al.  WikiNet: A Very Large Scale Multi-Lingual Concept Network , 2010, LREC.

[34]  Michael Skinner,et al.  Information arbitrage across multi-lingual Wikipedia , 2009, WSDM '09.

[35]  Abdelghani Bellaachia,et al.  Enhanced Query Expansion in English-Arabic CLIR , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[36]  Gerard de Melo,et al.  Information Extraction from Web-scale N-gram Data , 2010, SIGIR 2010.

[37]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[38]  Simone Paolo Ponzetto,et al.  Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia , 2009, IJCAI.

[39]  Gerhard Weikum,et al.  Untangling the Cross-Lingual Link Structure of Wikipedia , 2010, ACL.

[40]  Daniel S. Weld,et al.  Automatically refining the wikipedia infobox ontology , 2008, WWW.

[41]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[42]  Yuval Rabani,et al.  ON THE HARDNESS OF APPROXIMATING MULTICUT AND SPARSEST-CUT , 2005, 20th Annual IEEE Conference on Computational Complexity (CCC'05).