Knowledge extraction from Chinese wiki encyclopedias

The vision of the Semantic Web is to build a ‘Web of data’ that enables machines to understand the semantics of information on the Web. The Linked Open Data (LOD) project encourages people and organizations to publish various open data sets as Resource Description Framework (RDF) on the Web, which promotes the development of the Semantic Web. Among various LOD datasets, DBpedia has proved a successful structured knowledge base, and has become the central interlinking-hub of the Web of data in English. However, in the Chinese language, there is little linked data published and linked to DBpedia. This hinders the structured knowledge sharing of both Chinese and cross-lingual resources. This paper deals with an approach for building a large-scale Chinese structured knowledge base from Chinese wiki resources, including Hudong and Baidu Baike. The proposed approach first builds an ontology based on the wiki category system and infoboxes, and then extracts instances from wiki articles. Using Hudong as our source, our approach builds an ontology containing 19 542 concepts and 2381 properties. 802 593 instances are extracted and described using the concepts and properties in the extracted ontology and 62 679 of them are linked to equivalent instances in DBpedia. As from Baidu Baike, our approach builds an ontology containing 299 concepts, 37 object properties, and 5590 data type properties. 1 319 703 instances are extracted from Baidu Baike, and 84 343 of them are linked to instances in DBpedia. We provide RDF dumps and SPARQL endpoint to access the established Chinese knowledge bases. The knowledge bases built using our approach can be used not only in Chinese linked data building, but also in many useful applications of large-scale knowledge bases, such as question-answering and semantic search.

[1]  Douglas B. Lenat,et al.  CYC: a large-scale investment in knowledge infrastructure , 1995, CACM.

[2]  Piek Vossen,et al.  EuroWordNet: a multilingual database for information retrieval , 1997 .

[3]  Piek Vossen Introduction to EuroWordNet , 1998 .

[4]  Piek Vossen,et al.  EuroWordNet as a multilingual database , 1999 .

[5]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[6]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[7]  Adam Pease,et al.  Towards a standard upper ontology , 2001, FOIS.

[8]  Steffen Staab,et al.  Ontology Learning from Text , 2000, International Conference on Applications of Natural Language to Data Bases.

[9]  Adam Pease,et al.  IEEE standard upper ontology: a progress report , 2002, The Knowledge Engineering Review.

[10]  Aldo Gangemi,et al.  Ontology Learning and Its Application to Automated Terminology Translation , 2003, IEEE Intell. Syst..

[11]  Paola Velardi,et al.  Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites , 2004, CL.

[12]  Philipp Cimiano,et al.  Ontology Learning from Text: Methods, Evaluation and Applications , 2005 .

[13]  Michael J. Witbrock,et al.  An Introduction to the Syntax and Content of Cyc , 2006, AAAI Spring Symposium: Formalizing and Compiling Background Knowledge and Its Applications to Knowledge Representation and Question Answering.

[14]  Wendy Hall,et al.  The Semantic Web Revisited , 2006, IEEE Intelligent Systems.

[15]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[16]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[17]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[18]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[19]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[20]  Daniel S. Weld,et al.  Automatically refining the wikipedia infobox ontology , 2008, WWW.

[21]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[22]  Philipp Cimiano,et al.  Proceedings of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge , 2008 .

[23]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[24]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[25]  Oscar Corcho,et al.  Preliminary Results in Tag Disambiguation using DBpedia , 2009 .

[26]  Gerhard Weikum,et al.  The YAGO-NAGA approach to knowledge discovery , 2009, SGMD.

[27]  Gerhard Weikum,et al.  MENTA: inducing multilingual taxonomies from wikipedia , 2010, CIKM '10.

[28]  Alexandre Passant,et al.  dbrec - Music Recommendations Using DBpedia , 2010, SEMWEB.

[29]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.

[30]  Guilin Qi,et al.  Zhishi.me - Weaving Chinese Linking Open Data , 2011, SEMWEB.

[31]  Dimitris Kontokostas,et al.  Internationalization of Linked Data: The case of the Greek DBpedia edition , 2012, J. Web Semant..

[32]  J. Euzenat,et al.  Ontology Matching , 2007, Springer Berlin Heidelberg.