Representativeness of Knowledge Bases with the Generalized Benford's Law

Knowledge bases (KBs) such as DBpedia, Wikidata, and YAGO contain a huge number of entities and facts. Several recent works induce rules or calculate statistics on these KBs. Most of these methods are based on the assumption that the data is a representative sample of the studied universe. Unfortunately, KBs are biased because they are built from crowdsourcing and opportunistic agglomeration of available databases. This paper aims at approximating the representativeness of a relation within a knowledge base. For this, we use the generalized Benford’s law, which indicates the distribution expected by the facts of a relation. We then compute the minimum number of facts that have to be added in order to make the KB representative of the real world. Experiments show that our unsupervised method applies to a large number of relations. For numerical relations where ground truths exist, the estimated representativeness proves to be a reliable indicator.

[1]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[2]  Susan C. Herring,et al.  Cultural bias in Wikipedia content on famous persons , 2011, J. Assoc. Inf. Sci. Technol..

[3]  Nicoleta Preda,et al.  Semantic Culturomics (vision paper) , 2014, Proc. VLDB Endow..

[4]  David García,et al.  It's a Man's Wikipedia? Assessing Gender Inequality in an Online Encyclopedia , 2015, ICWSM.

[5]  Johanna Völker,et al.  Statistical Schema Induction , 2011, ESWC.

[6]  Dirk Helbing,et al.  A network framework of cultural history , 2014, Science.

[7]  Werner Hürlimann,et al.  Benford’s Law in Scientific Research , 2015 .

[8]  Fabian M. Suchanek,et al.  Using YAGO for the Humanities , 2017, WHiSe@ISWC.

[9]  M. Nigrini Benford's law : applications for forensic accounting, auditing, and fraud detection , 2012 .

[10]  Mehwish Alam,et al.  Interactive exploration over RDF data using formal concept analysis , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[11]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[12]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[13]  Werner Nutt,et al.  Identifying the Extent of Completeness of Query Answers over Partially Complete Databases , 2015, SIGMOD Conference.

[14]  Simon Razniewski,et al.  Predicting Completeness in Knowledge Bases , 2016, WSDM.

[15]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[16]  Jens Lehmann,et al.  Learning of OWL Class Descriptions on Very Large Knowledge Bases , 2008, SEMWEB.

[17]  Werner Nutt,et al.  Enabling Fine-Grained RDF Data Completeness Assessment , 2016, ICWE.

[18]  Simon Razniewski,et al.  Completeness-Aware Rule Learning from Knowledge Graphs , 2017, SEMWEB.

[19]  Werner Nutt,et al.  But What Do We Actually Know? , 2016, AKBC@NAACL-HLT.

[20]  Alon Y. Levy Obtaining Complete Answers from Incomplete Databases , 1996, VLDB 1996.

[21]  Fabian M. Suchanek,et al.  Are All People Married?: Determining Obligatory Attributes in Knowledge Bases , 2018, WWW.

[22]  Tomasz Kajdanowicz,et al.  Benford’s Distribution in Complex Networks , 2016, Scientific reports.

[23]  Mehwish Alam,et al.  Mining Definitions from RDF Annotations Using Formal Concept Analysis , 2015, IJCAI.

[24]  Omar Licandro,et al.  The longevity of famous people from Hammurabi to Einstein , 2013 .

[25]  Simon Razniewski,et al.  Enabling Completeness-aware Querying in SPARQL , 2017, WebDB.

[26]  Amihai Motro,et al.  Integrity = validity + completeness , 1989, TODS.

[27]  Jens Lehmann,et al.  Quality assessment for Linked Data: A Survey , 2015, Semantic Web.

[28]  Walter R. Mebane,et al.  Election Forensics: Vote Counts and Benford's Law , 2006 .

[29]  Pouyan Hatami Bahman Beiglou,et al.  Applicability of Benford’s Law to Compliance Assessment of Self-Reported Wastewater Treatment Plant Discharge Data , 2017 .

[30]  Werner Hürlimann,et al.  A First Digit Theorem for Powers of Perfect Powers , 2014 .

[31]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[32]  J. Giles Internet encyclopaedias go head to head , 2005, Nature.

[33]  Fabian M. Suchanek,et al.  Fast rule mining in ontological knowledge bases with AMIE+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$+$$\end{docu , 2015, The VLDB Journal.