Transforming Wikipedia into a large scale multilingual concept network

A knowledge base for real-world language processing applications should consist of a large base of facts and reasoning mechanisms that combine them to induce novel and more complex information. This paper describes an approach to deriving such a large scale and multilingual resource by exploiting several facets of the on-line encyclopedia Wikipedia. We show how we can build upon Wikipedia@?s existing network of categories and articles to automatically discover new relations and their instances. Working on top of this network allows for added information to influence the network and be propagated throughout it using inference mechanisms that connect different pieces of existing knowledge. We then exploit this gained information to discover new relations that refine some of those found in the previous step. The result is a network containing approximately 3.7 million concepts with lexicalizations in numerous languages and 49+ million relation instances. Intrinsic and extrinsic evaluations show that this is a high quality resource and beneficial to various NLP tasks.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Ramanathan V. Guha,et al.  Cyc: toward programs with common sense , 1990, CACM.

[3]  Mitsuru Ishizuka,et al.  Relation Extraction from Wikipedia Using Subtree Mining , 2007, AAAI.

[4]  Adam Kilgarriff,et al.  of the European Chapter of the Association for Computational Linguistics , 2006 .

[5]  Daniel S. Weld,et al.  Automatically refining the wikipedia infobox ontology , 2008, WWW.

[6]  Ian Witten,et al.  Data Mining , 2000 .

[7]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[8]  Gerhard Weikum,et al.  Database and information-retrieval methods for knowledge discovery , 2009, CACM.

[9]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[10]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[11]  Ian H. Witten,et al.  Mining Meaning from Wikipedia , 2008, Int. J. Hum. Comput. Stud..

[12]  James Pustejovsky,et al.  The Generative Lexicon , 1995, CL.

[13]  Mark Lauer,et al.  Designing Statistical Language Learners: Experiments on Noun Compounds , 1996, ArXiv.

[14]  Stefano Faralli,et al.  A Graph-Based Algorithm for Inducing Lexical Taxonomies from Scratch , 2011, IJCAI.

[15]  Iryna Gurevych,et al.  Accessing GermaNet Data and Computing Semantic Relatedness , 2005, ACL.

[16]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[17]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[18]  Haixun Wang,et al.  Towards a Probabilistic Taxonomy of Many Concepts , 2011 .

[19]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[20]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[21]  Michael Strube,et al.  Decoding Wikipedia Categories for Knowledge Acquisition , 2008, AAAI.

[22]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[23]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[24]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[25]  Oren Etzioni,et al.  Machine Reading at the University of Washington , 2010, HLT-NAACL 2010.

[26]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[27]  Simone Paolo Ponzetto,et al.  Taxonomy induction based on a collaboratively built knowledge repository , 2011, Artif. Intell..

[28]  Michael Strube,et al.  Combining Collocations, Lexical and Encyclopedic Knowledge for Metonymy Resolution , 2009, EMNLP.

[29]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[30]  Iraklis Varlamis,et al.  Text Relatedness Based on a Word Thesaurus , 2010, J. Artif. Intell. Res..

[31]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[32]  Jerry R. Hobbs,et al.  Interpretation as Abduction , 1993, Artif. Intell..

[33]  Udo Hahn,et al.  Understanding metonymies in discourse , 2002, Artif. Intell..

[34]  Michael Strube,et al.  WikiNet: A Very Large Scale Multi-Lingual Concept Network , 2010, LREC.

[35]  Ian H. Witten,et al.  An effective, low-cost measure of semantic relatedness obtained from Wikipedia links , 2008 .

[36]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[37]  Gerhard Weikum,et al.  Untangling the Cross-Lingual Link Structure of Wikipedia , 2010, ACL.

[38]  Pedro M. Domingos,et al.  Unsupervised Ontology Induction from Text , 2010, ACL.

[39]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[40]  Simone Paolo Ponzetto,et al.  Knowledge Derived From Wikipedia For Computing Semantic Relatedness , 2007, J. Artif. Intell. Res..

[41]  Carina Silberer,et al.  Building a Multilingual Lexical Resource for Named Entity Disambiguation, Translation and Transliteration , 2008, LREC.

[42]  Adam Pease,et al.  Towards a standard upper ontology , 2001, FOIS.

[43]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[44]  Dan Fass,et al.  met*: A Method for Discriminating Metonymy and Metaphor by Computer , 1991, CL.

[45]  Dekang Lin,et al.  WordNet: An Electronic Lexical Database , 1998 .

[46]  Marius Pasca,et al.  Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds , 2007, WWW '07.

[47]  Gerhard Weikum,et al.  MENTA: inducing multilingual taxonomies from wikipedia , 2010, CIKM '10.

[48]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[49]  Mirella Lapata,et al.  Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics , 1999, ACL 1999.

[50]  Ralph Grishman,et al.  Using NOMLEX to Produce Nominalization Patterns for Information Extraction , 1998, ACL 1998.

[51]  Michael Strube,et al.  WikiNetTK – A Tool Kit for EmbeddingWorld Knowledge in NLP Applications , 2011, IJCNLP.

[52]  Simone Paolo Ponzetto,et al.  BabelNet: Building a Very Large Multilingual Semantic Network , 2010, ACL.

[53]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[54]  Tom M. Mitchell,et al.  Random Walk Inference and Learning in A Large Scale Knowledge Base , 2011, EMNLP.

[55]  Malvina Nissim,et al.  SemEval-2007 Task 08: Metonymy Resolution at SemEval-2007 , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[56]  Jerry R. Hobbs,et al.  Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading , 2010, HLT-NAACL 2010.

[57]  Malvina Nissim,et al.  Comparing Knowledge Sources for Nominal Anaphora Resolution , 2005, Computational Linguistics.

[58]  G. Lakoff,et al.  Metaphors We Live by , 1982 .

[59]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .