Non-Parametric Class Completeness Estimators for Collaborative Knowledge Graphs - The Case of Wikidata

Collaborative Knowledge Graph platforms allow humans and automated scripts to collaborate in creating, updating and interlinking entities and facts. To ensure both the completeness of the data as well as a uniform coverage of the different topics, it is crucial to identify underrepresented classes in the Knowledge Graph. In this paper, we tackle this problem by developing statistical techniques for class cardinality estimation in collaborative Knowledge Graph platforms. Our method is able to estimate the completeness of a class - as defined by a schema or ontology - hence can be used to answer questions such as "Does the knowledge base have a complete list of all {Beer Brands|Volcanos|Video Game Consoles}?" As a use-case, we focus on Wikidata, which poses unique challenges in terms of the size of its ontology, the number of users actively populating its graph, and its extremely dynamic nature. Our techniques are derived from species estimation and data-management methodologies, and are applied to the case of graphs and collaborative editing. In our empirical evaluation, we observe that i) the number and frequency of unique class instances drastically influence the performance of an estimator, ii) bursts of inserts cause some estimators to overestimate the true size of the class if they are not properly handled, and iii) one can effectively measure the convergence of a class towards its true size by considering the stability of an estimator against the number of available instances.

[1]  Panagiotis G. Ipeirotis,et al.  Demographics and Dynamics of Mechanical Turk Workers , 2018, WSDM.

[2]  Lucie-Aimée Kaffee,et al.  The Human Face of the Web of Data: A Cross-sectional Study of Labels , 2018, SEMANTICS.

[3]  Werner Nutt,et al.  Recoin: Relative Completeness in Wikidata , 2018, WWW.

[4]  Béatrice Bouchou-Markhoff,et al.  Representativeness of Knowledge Bases with the Generalized Benford's Law , 2018, SEMWEB.

[5]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[6]  Wolfgang Nejdl,et al.  Cardinality estimation and dynamic length adaptation for Bloom filters , 2010, Distributed and Parallel Databases.

[7]  Jure Leskovec,et al.  Growing Wikipedia Across Languages via Recommendation , 2016, WWW.

[8]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[9]  Gianluca Demartini,et al.  The Evolution of Power and Standard Wikidata Editors: Comparing Editing Behavior over Time to Predict Lifespan and Volume of Edits , 2018, Computer Supported Cooperative Work (CSCW).

[10]  S. Morand,et al.  Comparative performance of species richness estimation methods , 1998, Parasitology.

[11]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[12]  Werner Nutt,et al.  Completeness Management for RDF Data Sources , 2018, ACM Trans. Web.

[13]  J. Heltshe,et al.  Estimating species richness using the jackknife procedure. , 1983, Biometrics.

[14]  Michael Günther,et al.  Introducing Wikidata to the Linked Data Web , 2014, SEMWEB.

[15]  Michael V. Mannino,et al.  Statistical profile estimation in database systems , 1988, CSUR.

[16]  Simon Razniewski,et al.  Predicting Completeness in Knowledge Bases , 2016, WSDM.

[17]  A. Chao,et al.  Estimating the Number of Classes via Sample Coverage , 1992 .

[18]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[19]  Purnamrita Sarkar,et al.  Crowdsourced enumeration queries , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[20]  A. Chao,et al.  An improved nonparametric lower bound of species richness via a modified good–turing frequency formula , 2014, Biometrics.

[21]  K. Burnham,et al.  Robust Estimation of Population Size When Capture Probabilities Vary Among Animals , 1979 .

[22]  Jens Lehmann,et al.  Quality assessment for Linked Data: A Survey , 2015, Semantic Web.

[23]  J. Bunge,et al.  Estimating the Number of Species: A Review , 1993 .

[24]  Simon Razniewski,et al.  Completeness-Aware Rule Learning from Knowledge Graphs , 2017, SEMWEB.