A Study of the Quality of Wikidata

Wikidata has been increasingly adopted by many communities for a wide variety of applications, which demand high-quality knowledge to deliver successful results. In this paper, we develop a framework to detect and analyze low-quality statements in Wikidata by shedding light on the current practices exercised by the community. We explore three indicators of data quality in Wikidata, based on: 1) community consensus on the currently recorded knowledge, assuming that statements that have been removed and not added back are implicitly agreed to be of low quality; 2) statements that have been deprecated; and 3) constraint violations in the data. We combine these indicators to detect lowquality statements, revealing challenges with duplicate entities, missing triples, violated type rules, and taxonomic distinctions. Our findings complement ongoing efforts by the Wikidata community to improve data quality, aiming to make it easier for users and editors to find and correct mistakes.

[1]  Markus Krötzsch,et al.  Wikidata , 2014 .

[2]  Gerhard Weikum,et al.  YAGO 4: A Reason-able Knowledge Base , 2020, ESWC.

[3]  Markus Krötzsch,et al.  Logic on MARS: Ontologies for Generalised Property Graphs , 2017, IJCAI.

[4]  Achim Rettinger,et al.  Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO , 2017, Semantic Web.

[5]  Deborah L. McGuinness,et al.  OWL Web ontology language overview , 2004 .

[6]  Werner Nutt,et al.  Recoin: Relative Completeness in Wikidata , 2018, WWW.

[7]  Jens Lehmann,et al.  Message Passing for Hyper-Relational Knowledge Graphs , 2020, EMNLP.

[8]  Thomas Shafee,et al.  Using logical constraints to validate information in collaborative knowledge graphs: a study of COVID-19 on Wikidata , 2020 .

[9]  Elena Simperl,et al.  What we talk about when we talk about wikidata quality: a literature survey , 2019, OpenSym.

[10]  Bernardo A. Huberman,et al.  Cooperation and quality in wikipedia , 2007, WikiSym '07.

[11]  Pedro Szekely,et al.  Wikidata deprecated statements by Jan 2021. , 2021 .

[12]  Elena Paslaru Bontas Simperl,et al.  Who Models the World?: Collaborative Ontology Creation and User Roles in Wikidata , 2018, Proc. ACM Hum. Comput. Interact..

[13]  Lucie-Aimée Kaffee,et al.  Provenance Information in a Collaborative Knowledge Graph: An Evaluation of Wikidata External References , 2017, SEMWEB.

[14]  Paolo Rosso,et al.  Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction , 2020, WWW.

[15]  José Emilio Labra Gayo,et al.  Using Shape Expressions (ShEx) to Share RDF Data Models and to Guide Curation with Rigorous Validation , 2019, ESWC.

[16]  Lucy Bastin,et al.  Assessing VGI Data Quality , 2017 .

[17]  Jiangping Chen,et al.  A Practical Framework for Evaluating the Quality of Knowledge Graph , 2019, CCKS.

[18]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[19]  Marco Torchiano,et al.  A quality assessment approach for evolving knowledge bases , 2019, Semantic Web.

[20]  Stefan Schlobach,et al.  Literally better: Analyzing and improving the quality of literals , 2017, Semantic Web.

[21]  Peter Mooney,et al.  Characteristics of Heavily Edited Objects in OpenStreetMap , 2012, Future Internet.

[22]  Sneha Narayan,et al.  Classifying Wikipedia Article Quality With Revision History Networks , 2020, OpenSym.

[23]  Daniel Garijo,et al.  Constraint violation summaries (Dump: Dec 7th, 2020) , 2021 .

[24]  Elena Simperl,et al.  What Makes a Good Collaborative Knowledge Graph: Group Composition and Quality in Wikidata , 2017, SocInfo.

[25]  Denilson Barbosa,et al.  Typing Errors in Factual Knowledge Graphs: Severity and Possible Ways Out , 2021, WWW.

[26]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[27]  Edmund A. Mennis The Wisdom of Crowds: Why the Many Are Smarter than the Few and How Collective Wisdom Shapes Business, Economies, Societies, and Nations , 2006 .

[28]  Stefan Schlobach,et al.  LOD Laundromat: A Uniform Way of Publishing Other People's Dirty Data , 2014, SEMWEB.

[29]  Martha Palmer,et al.  Verbnet: a broad-coverage, comprehensive verb lexicon , 2005 .

[30]  José Emilio Labra Gayo,et al.  Shape Designer for ShEx and SHACL constraints , 2019, SEMWEB.