Wikitology: a novel hybrid knowledge base derived from wikipedia

World knowledge may be available in different forms such as relational databases, triple stores, link graphs, meta-data and free text. Human minds are capable of understanding and reasoning over knowledge represented in different ways and are influenced by different social, contextual and environmental factors. By following a similar model, we have integrated a variety of knowledge sources in a novel way to produce a single hybrid knowledge base i.e., Wikitology, enabling applications to better access and exploit knowledge hidden in different forms. Wikipedia proves to be an invaluable resource for generating a hybrid knowledge base due to the availability and interlinking of structured, semi-structured and un-structured encyclopedic information. However, Wikipedia is designed in a way that facilitates human understanding and contribution by providing interlinking of articles and categories for better browsing and search of information, making the content easily understandable to humans but requiring intelligent approaches for being exploited by applications directly. Research projects like Cyc [61] have resulted in the development of a complex broad coverage knowledge base, however, relatively few applications have been built that really exploit it. In contrast, the design and development of Wikitology KB has been incremental and has been driven and guided by a variety of applications and approaches that exploit the knowledge available in Wikipedia in different ways. This evolution has resulted in the development of a hybrid knowledge base that not only incorporates and integrates a variety of knowledge resources but also a variety of data structures, and exposes the knowledge hidden in different forms to applications through a single integrated query interface. We demonstrate the value of the derived knowledge base by developing problem specific intelligent approaches that exploit Wikitology for a diverse set of use cases, namely, document concept prediction, cross document co-reference resolution defined as a task in Automatic Content Extraction (ACE) [1], Entity Linking to KB entities defined as a part of Text Analysis Conference - Knowledge Base Population Track 2009 [65] and interpreting tables [94]. These use cases directly serve to evaluate the utility of the knowledge base for different applications and also demonstrate how the knowledge base could be exploited in different ways. Based on our work we have also developed a Wikitology API that applications can use to exploit this unique hybrid knowledge resource for solving real world problems. The different use cases that exploit Wikitology for solving real world problems also contribute to enriching the knowledge base automatically. The document concept prediction approach can predict inter-article and category-links for new Wikipedia articles. Cross document co-reference resolution and entity linking provide a way for specifically linking entity mentions in Wikipedia articles or external articles to the entity articles in Wikipedia and also help in suggesting redirects. In addition to that we have also developed specific approaches aimed at automatically enriching the Wikitology KB by unsupervised discovery of ontology elements using the inter-article links, generating disambiguation trees for entities and estimating the page rank of Wikipedia concepts to serve as a measure of popularity. The set of approaches combined together can contribute to a number of steps in a broader unified framework for automatically adding new concepts to the Wikitology knowledge base.

[1]  Ahmet Arslan,et al.  A comparison of Relational Databases and information retrieval libraries on Turkish text retrieval , 2008, 2008 International Conference on Natural Language Processing and Knowledge Engineering.

[2]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[3]  Maged M. Michael,et al.  Scalability of the Nutch search engine , 2007, ICS '07.

[4]  Timothy W. Finin,et al.  Creating and Exploiting a Web of Semantic Data , 2010, ICAART.

[5]  R. Bonato Network Analysis for Wikipedia , 2005 .

[6]  Daniel S. Weld,et al.  Automatically refining the wikipedia infobox ontology , 2008, WWW.

[7]  Melvil Dewey,et al.  Abridged Dewey decimal classification and relative index , 1971 .

[8]  J. Carroll,et al.  Jena: implementing the semantic web recommendations , 2004, WWW Alt. '04.

[9]  V. Zlatic,et al.  Wikipedias: collaborative web-based encyclopedias as complex networks. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  P. Ingwersen,et al.  Proceedings of ISSI 2005 – The 10th International Conference of the International Society for Scientometrics and Informetrics: Stockholm, Sweden, July 24-28, 2005 , 2005 .

[11]  Edoardo M. Airoldi,et al.  Network Analysis of Wikipedia , 2008 .

[12]  Tim Berners-Lee,et al.  Linked data on the web (LDOW2008) , 2008, WWW.

[13]  Takahiro Hara,et al.  Wikipedia Mining for an Association Web Thesaurus Construction , 2007, WISE.

[14]  Gilad Mishne,et al.  Using Wikipedia at the TREC QA Track , 2004, TREC.

[15]  Nicholas Gibbins,et al.  3store: Efficient Bulk RDF Storage , 2003, PSSS.

[16]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[17]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[18]  Péter Schönhofen Identifying document topics using the Wikipedia category network , 2009, Web Intell. Agent Syst..

[19]  Yuji Matsumoto,et al.  A Graph-Based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields , 2007, EMNLP.

[20]  Steffen Staab,et al.  Measuring Similarity between Ontologies , 2002, EKAW.

[21]  Gottfried Vossen,et al.  Web Information Systems Engineering (WISE) , 2009 .

[22]  Ee-Peng Lim,et al.  Measuring article quality in wikipedia: models and evaluation , 2007, CIKM '07.

[23]  Simone Paolo Ponzetto,et al.  Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution , 2006, NAACL.

[24]  Asunción Gómez-Pérez,et al.  ONTOMETRIC: A Method to Choose the Appropriate Ontology , 2004, J. Database Manag..

[25]  Beng Chin Ooi,et al.  The Claremont report on database research , 2008, SGMD.

[26]  Fabio Crestani,et al.  Application of Spreading Activation Techniques in Information Retrieval , 1997, Artificial Intelligence Review.

[27]  Iryna Gurevych,et al.  Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary , 2008, LREC.

[28]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[29]  David Yarowsky,et al.  Structural, Transitive and Latent Models for Biographic Fact Extraction , 2009, EACL.

[30]  G. Caldarelli,et al.  Taxonomy and clustering in collaborative systems: The case of the on-line encyclopedia Wikipedia , 2007, 0710.3058.

[31]  T. Kalamboukis,et al.  Text Classification Using Clustering , 2006 .

[32]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[33]  Bayle Shanks WikiGateway: a library for interoperability and accelerated wiki development , 2005, Int. Sym. Wikis.

[34]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[35]  Yorick Wilks,et al.  Data Driven Ontology Evaluation , 2004, LREC.

[36]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[37]  Kristin P. Bennett,et al.  Support vector machines: hype or hallelujah? , 2000, SKDD.

[38]  Neal S. Coulter,et al.  Computing classification system 1998: Current status and future maintenance , 1998 .

[39]  Dirk-Willem van Gulik,et al.  Indexing and retrieving Semantic Web resources: the RDFStore model , 2003 .

[40]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[41]  Timothy W. Finin,et al.  Using Wikitology for Cross-Document Entity Coreference Resolution , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[42]  Ramanathan V. Guha,et al.  Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project , 1990 .

[43]  Daniel S. Weld,et al.  Using Wikipedia to bootstrap open information extraction , 2009, SGMD.

[44]  Jerry R. Hobbs,et al.  Learning by Reading: A Prototype System, Performance Baseline and Lessons Learned , 2007, AAAI.

[45]  Andrew Krizhanovsky Synonym search in Wikipedia: Synarcher , 2006, ArXiv.

[46]  David Yarowsky,et al.  Cross-Document Coreference Resolution: A Key Technology for Learning by Reading , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[47]  Evgeniy Gabrilovich,et al.  Concept-Based Feature Generation and Selection for Information Retrieval , 2008, AAAI.

[48]  Katy Börner,et al.  Analyzing and visualizing the semantic coverage of Wikipedia and its authors: Research Articles , 2007 .

[49]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[50]  Satoshi Sekine,et al.  Preemptive Information Extraction using Unrestricted Relation Discovery , 2006, NAACL.

[51]  D. Sánchez,et al.  Automatic information extraction from the Web , 2004 .

[52]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[53]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[54]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[55]  Aaron Vegh MySQL Database Server , 2011 .

[56]  Naoaki Okazaki,et al.  Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web , 2009, ACL.

[57]  Sergei Nirenburg,et al.  Evaluating the performance of the OntoSem semantic analyzer , 2004 .

[58]  Rada Mihalcea,et al.  Using Wikipedia for Automatic Word Sense Disambiguation , 2007, NAACL.

[59]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[60]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[61]  Martin Hepp,et al.  Harvesting Wiki Consensus: Using Wikipedia Entries as Vocabulary for Knowledge Management , 2007, IEEE Internet Computing.

[62]  Simone Paolo Ponzetto,et al.  WikiTaxonomy: A Large Scale Knowledge Resource , 2008, ECAI.

[63]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[64]  Martin Hepp,et al.  Harvesting Wiki Consensus - Using Wikipedia Entries as Ontology Elements , 2006, SemWiki.

[65]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[66]  Antonio Toral,et al.  A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia , 2006, Workshop On New Text Wikis And Blogs And Other Dynamic Text Sources.

[67]  Katy Börner,et al.  Analyzing and visualizing the semantic coverage of Wikipedia and its authors , 2005, Complex..

[68]  Mitsuru Ishizuka,et al.  Subtree Mining for Relation Extraction from Wikipedia , 2007, NAACL.

[69]  J. Giles Internet encyclopaedias go head to head , 2005, Nature.

[70]  Markus Krötzsch,et al.  Semantic MediaWiki , 2006, Foundations for the Web of Information and Services.

[71]  Tomáš Kliegr,et al.  Unsupervised Entity Classification with Wikipedia and WordNet , 2007 .

[72]  David F. Wood,et al.  Kowari: A Platform for Semantic Web Storage and Analysis , 2005, WWW 2005.

[73]  Gang Wang,et al.  Enhancing Relation Extraction by Eliciting Selectional Constraint Features from Wikipedia , 2007, NLDB.

[74]  Tim Finin,et al.  Exploiting a Web of Semantic Data for Interpreting Tables , 2010 .

[75]  James A. Thom,et al.  Entity ranking in Wikipedia , 2007, SAC '08.

[76]  Andreas Harth,et al.  Optimized index structures for querying RDF from the Web , 2005, Third Latin American Web Congress (LA-WEB'2005).

[77]  Li Ding,et al.  How the Semantic Web is Being Used: An Analysis of FOAF Documents , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[78]  James A. Thom,et al.  Ontology evaluation using wikipedia categories for browsing , 2007, CIKM '07.

[79]  Razvan C. Bunescu,et al.  Learning for information extraction: from named entity recognition and disambiguation to relation extraction , 2007 .

[80]  Kentaro Torisawa,et al.  Exploiting Wikipedia as External Knowledge for Named Entity Recognition , 2007, EMNLP.

[81]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[82]  Ralph Grishman,et al.  Discovering Relations among Named Entities from Large Corpora , 2004, ACL.

[83]  Leo Sauermann,et al.  The Sesame LuceneSail : RDF Queries with Full-text Search NEPOMUK Technical Report 2008-1 , 2008 .

[84]  Timothy W. Finin,et al.  Wikipedia as an Ontology for Describing Documents , 2008, ICWSM.

[85]  Heng Ji,et al.  Overview of the TAC 2010 Knowledge Base Population Track , 2010 .

[86]  Wolfgang Nejdl,et al.  Extracting Semantics Relationships between Wikipedia Categories , 2006, SemWiki.

[87]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.