Knowledge Graph Completeness: A Systematic Literature Review

The quality of a Knowledge Graph (also known as Linked Data) is an important aspect to indicate its fitness for use in an application. Several quality dimensions are identified, such as accuracy, completeness, timeliness, provenance, and accessibility, which are used to assess the quality. While many prior studies offer a landscape view of data quality dimensions, here we focus on presenting a systematic literature review for assessing the completeness of Knowledge Graph. We gather existing approaches from the literature and analyze them qualitatively and quantitatively. In particular, we unify and formalize commonly used terminologies across 56 articles related to the completeness dimension of data quality and provide a comprehensive list of methodologies and metrics used to evaluate the different types of completeness. We identify seven types of completeness, including three types that were not previously identified in previous surveys. We also analyze nine different tools capable of assessing Knowledge Graph completeness. The aim of this Systematic Literature Review is to provide researchers and data curators a comprehensive and deeper understanding of existing works on completeness and its properties, thereby encouraging further experimentation and development of new approaches focused on completeness as a data quality dimension of Knowledge Graph.

[1]  Béatrice Bouchou-Markhoff,et al.  Representativeness of Knowledge Bases with the Generalized Benford's Law , 2018, SEMWEB.

[2]  Werner Bailer,et al.  Data Quality Assessment in Europeana: Metrics for Multilinguality , 2017, TDDL/MDQual/Futurity@TPDL.

[3]  Jens Lehmann,et al.  Assessing Linked Data Mappings Using Network Measures , 2012, ESWC.

[4]  Andrea Maurino,et al.  Capturing the Currency of DBpedia Descriptions and Get Insight into their Validity , 2014, COLD.

[5]  Achim Rettinger,et al.  Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO , 2017, Semantic Web.

[6]  Antoon Bronselaer,et al.  A Measure-Theoretic Foundation for Data Quality , 2018, IEEE Transactions on Fuzzy Systems.

[7]  Dmitry Mouromtsev,et al.  Towards the Russian Linked Culture Cloud: Data Enrichment and Publishing , 2015, ESWC.

[8]  Werner Nutt,et al.  Completeness and soundness guarantees for conjunctive SPARQL queries over RDF data sources with completeness statements , 2020, Semantic Web.

[9]  Christian Bizer,et al.  Sieve: linked data quality assessment and fusion , 2012, EDBT-ICDT '12.

[10]  Behshid Behkamal Metrics-Driven Framework for LOD Quality Assessment , 2014, ESWC.

[11]  Werner Nutt,et al.  Recoin: Relative Completeness in Wikidata , 2018, WWW.

[12]  M. Alchaita,et al.  ENHANCING DBPEDIA QUALITY USING MARKOV LOGIC NETWORKS , 2018 .

[13]  Gianluca Demartini,et al.  Non-Parametric Class Completeness Estimators for Collaborative Knowledge Graphs - The Case of Wikidata , 2019, SEMWEB.

[14]  Maribel Acosta,et al.  HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowdsourcing , 2017, J. Web Semant..

[15]  Jürgen Umbrich,et al.  Automated Quality Assessment of Metadata across Open Data Portals , 2016, JDIQ.

[16]  Mohsen Kahani,et al.  A Metrics-Driven Approach for Quality Assessment of Linked Open Data , 2014, J. Theor. Appl. Electron. Commer. Res..

[17]  Felician Campean,et al.  Towards a Data Quality Framework for Heterogeneous Data , 2017, 2017 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData).

[18]  Jens Lehmann,et al.  Test-driven evaluation of linked data quality , 2014, WWW.

[19]  Subhi Issa Linked data quality : completeness and conciseness , 2019 .

[20]  Werner Nutt,et al.  Enabling Fine-Grained RDF Data Completeness Assessment , 2016, ICWE.

[21]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[22]  Jeremy Debattista,et al.  Towards Ontology Quality Assessment , 2017, MEPDaW/LDQ@ESWC.

[23]  Heiko Paulheim,et al.  Knowledge graph refinement: A survey of approaches and evaluation methods , 2016, Semantic Web.

[24]  Jens Lehmann,et al.  Quality assessment for Linked Data: A Survey , 2015, Semantic Web.

[25]  Christoph Lange,et al.  Evaluating the quality of the LOD cloud: An empirical investigation , 2018, Semantic Web.

[26]  Ryutaro Ichise,et al.  Interlinking Linked Data Sources Using a Domain-Independent System , 2012, JIST.

[27]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[28]  Raphaël Troncy,et al.  Towards An Objective Assessment Framework for Linked Data Quality: Enriching Dataset Profiles with Quality Indicators , 2016, Int. J. Semantic Web Inf. Syst..

[29]  Elena Paslaru Bontas Simperl,et al.  Labels in the Web of Data , 2011, SEMWEB.

[30]  Jad El-khoury,et al.  Methodology for linked enterprise data quality assessment through information visualizations , 2019, J. Ind. Inf. Integr..

[31]  Carlo Batini,et al.  Completeness in the Relational Model: a Comprehensive Framework , 2004, ICIQ.

[32]  Cinzia Daraio,et al.  Using Linked Data to Evaluate the Impact of Research and Development in Europe: A Structural Equation Model , 2013, SEMWEB.

[33]  Maria-Esther Vidal,et al.  BOUNCER: Privacy-Aware Query Processing over Federations of RDF Datasets , 2018, DEXA.

[34]  Amal Zouaq,et al.  Assessing and Improving Domain Knowledge Representation in DBpedia , 2017, Open J. Semantic Web.

[35]  Raphaël Troncy,et al.  What's up LOD Cloud? Observing The State of Linked Open Data Cloud Metadata , 2015, LDQ@ESWC.

[36]  Martin Hepp,et al.  Swiqa - a semantic web information quality assessment framework , 2011, ECIS.

[37]  Yannis Tzitzikas,et al.  How Linked Data can Aid Machine Learning-Based Tasks , 2017, TPDL.

[38]  Hugh Glaser,et al.  URI Disambiguation in the Context of Linked Data , 2008, LDOW.

[39]  Peter Christen,et al.  Quality and Complexity Measures for Data Linkage and Deduplication , 2007, Quality Measures in Data Mining.

[40]  Raphaël Troncy,et al.  A Two-Fold Quality Assurance Approach for Dynamic Knowledge Bases: The 3cixty Use Case , 2016, @ESWC.

[41]  Werner Nutt,et al.  Comparing Index Structures for Completeness Reasoning , 2018, 2018 International Workshop on Big Data and Information Security (IWBIS).

[42]  Julian Szymanski,et al.  RDF dataset profiling - a survey of features, methods, vocabularies and applications , 2018, Semantic Web.

[43]  Amihai Motro,et al.  Estimating the Quality of Databases , 1998, FQAS.

[44]  Riccardo Albertoni,et al.  A Linkset Quality Metric Measuring Multilingual Gain in SKOS Thesauri , 2015, LDQ@ESWC.

[45]  Simon Razniewski,et al.  Predicting Completeness in Knowledge Bases , 2016, WSDM.

[46]  Jeff Heflin,et al.  Automatically Generating Data Linkages Using a Domain-Independent Candidate Selection Approach , 2011, SEMWEB.

[47]  Nasser Ghadiri,et al.  A framework for linked data fusion and quality assessment , 2017, 2017 3th International Conference on Web Research (ICWR).

[48]  Pierre-Henri Paris,et al.  Assessing the Completeness Evolution of DBpedia: A Case Study , 2017, ER Workshops.

[49]  Robert Isele,et al.  LDIF - A Framework for Large-Scale Linked Data Integration , 2012 .

[50]  Nandana Mihindukulasooriya,et al.  A comprehensive quality model for Linked Data , 2018, Semantic Web.

[51]  Mohsen Kahani,et al.  A metric-driven approach for interlinking assessment of RDF graphs , 2015, 2015 International Symposium on Computer Science and Software Engineering (CSSE).

[52]  Fabian M. Suchanek,et al.  Are All People Married?: Determining Obligatory Attributes in Knowledge Bases , 2018, WWW.

[53]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[54]  Byron C. Wallace,et al.  Living systematic reviews: 2. Combining human and machine effort. , 2017, Journal of clinical epidemiology.

[55]  Martin Necaský,et al.  Linked Open Data Aggregation: Conflict Resolution and Aggregate Quality , 2012, 2012 IEEE 36th Annual Computer Software and Applications Conference Workshops.

[56]  Martin Hepp,et al.  Towards a vocabulary for data quality management in semantic web architectures , 2011, LWDM '11.

[57]  Chantal Reynaud,et al.  A Model for Linked Open Data Acquisition and SPARQL Query Generation , 2016, ICCS.

[58]  Sylvain Kubler,et al.  Comparison of metadata quality in open data portals using the Analytic Hierarchy Process , 2017, Gov. Inf. Q..

[59]  Jens Lehmann,et al.  CROCUS: Cluster-based Ontology Data Cleansing , 2014, WaSABi-FEOSW@ESWC.

[60]  Declan O'Sullivan,et al.  Improving Curated Web-Data Quality with Structured Harvesting and Assessment , 2014, Int. J. Semantic Web Inf. Syst..

[61]  Andrea Maurino,et al.  Web Data Quality: Current State and New Challenges , 2014, Int. J. Semantic Web Inf. Syst..

[62]  Asunción Gómez-Pérez,et al.  Assessing linkset quality for complementing third-party datasets , 2013, EDBT '13.

[63]  Felix Naumann,et al.  Quality-Driven Query Answering for Integrated Information Systems , 2002, Lecture Notes in Computer Science.

[64]  Marco Torchiano,et al.  KBQ - A Tool for Knowledge Base Quality Assessment Using Evolution Analysis , 2017, K-CAP Workshops.

[65]  Andrea Maurino,et al.  Capturing the Age of Linked Open Data: Towards a Dataset-Independent Framework , 2012, 2012 IEEE Sixth International Conference on Semantic Computing.

[66]  Seán O'Riain,et al.  Towards unified and native enrichment in event processing systems , 2013, DEBS.

[67]  Maria-Esther Vidal,et al.  Analyzing Linked Data Quality with LiQuate , 2013, ESWC.

[68]  Richard Y. Wang,et al.  Data Quality Assessment , 2002 .

[69]  Pearl Brereton,et al.  Performing systematic literature reviews in software engineering , 2006, ICSE.

[70]  Kemele M. Endris,et al.  Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment , 2016, WIMS.

[71]  Athanasios Manitsaris,et al.  Quantifying and measuring metadata completeness , 2012, J. Assoc. Inf. Sci. Technol..

[72]  Pierre-Henri Paris,et al.  Revealing the Conceptual Schemas of RDF Datasets , 2019, CAiSE.

[73]  Heiko Paulheim,et al.  Improving the Quality of Linked Data Using Statistical Distributions , 2014, Int. J. Semantic Web Inf. Syst..