The YAGO-NAGA approach to knowledge discovery

This paper gives an overview on the YAGO-NAGA approach to information extraction for building a conveniently searchable, large-scale, highly accurate knowledge base of common facts. YAGO harvests infoboxes and category names of Wikipedia for facts about individual entities, and it reconciles these with the taxonomic backbone of WordNet in order to ensure that all entities have proper classes and the class system is consistent. Currently, the YAGO knowledge base contains about 19 million instances of binary relations for about 1.95 million entities. Based on intensive sampling, its accuracy is estimated to be above 95 percent. The paper presents the architecture of the YAGO extractor toolkit, its distinctive approach to consistency checking, its provisions for maintenance and further growth, and the query engine for YAGO, coined NAGA. It also discusses ongoing work on extensions towards integrating fact candidates extracted from natural-language text sources.

[1]  Douglas E. Appelt,et al.  Introduction to Information Extraction , 1999, AI Commun..

[2]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[3]  W. Bruce Croft,et al.  Statistical language modeling for information retrieval , 2006, Annu. Rev. Inf. Sci. Technol..

[4]  Eugene Agichtein Scaling Information Extraction to Large Document Collections , 2005, IEEE Data Eng. Bull..

[5]  Gerhard Weikum,et al.  Combining linguistic and statistical analysis to extract relations from web documents , 2006, KDD '06.

[6]  Wei-Ying Ma,et al.  Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[7]  John D. Lafferty,et al.  A risk minimization framework for information retrieval , 2006, Inf. Process. Manag..

[8]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[9]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[10]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[11]  Wei-Ying Ma,et al.  Web object retrieval , 2007, WWW '07.

[12]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[13]  Oren Etzioni,et al.  Structured Querying of Web Text Data: A Technical Challenge , 2007, CIDR.

[14]  Raghu Ramakrishnan,et al.  Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach , 2007, VLDB.

[15]  Luis Gravano,et al.  Towards a query optimizer for text-centric tasks , 2007, TODS.

[16]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[17]  Daniel S. Weld,et al.  Automatically refining the wikipedia infobox ontology , 2008, WWW.

[18]  Gerhard Weikum,et al.  NAGA: Searching and Ranking Knowledge , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[19]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[20]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[21]  Gerhard Weikum,et al.  TOB: Timely Ontologies for Business Relations , 2008, WebDB.

[22]  Gerhard Weikum,et al.  Personalizing the Search for Knowledge , 2008, PersDB.

[23]  Frederick Reiss,et al.  An Algebraic Approach to Rule-Based Information Extraction , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[24]  Gerhard Weikum,et al.  STAR: Steiner-Tree Approximation in Relationship Graphs , 2009, 2009 IEEE 25th International Conference on Data Engineering.