HInT: Hybrid and Incremental Type Discovery for Large RDF Data Sources

The rapid explosion of linked data has resulted into many weakly structured and incomplete data sources, where typing information might be missing. On the other hand, type information is essential for a number of tasks such as query answering, integration, summarization and partitioning. Existing approaches for type discovery, either completely ignore type declarations available in the dataset (implicit type discovery approaches), or rely only on existing types, in order to complement them (explicit type enrichment approaches). Implicit type discovery approaches are based on instance grouping, which requires an exhaustive comparison between the instances. This process is expensive and not incremental. Explicit type enrichment approaches on the other hand, are not able to identify new types and they can not process data sources that have little or no schema information. In this paper, we present HInT, the first incremental and hybrid type discovery system for RDF datasets, enabling type discovery in datasets where type declarations are missing. To achieve this goal, we incrementally identify the patterns of the various instances, we index and then group them to identify the types. During the processing of an instance, our approach exploits its type information, if available, to improve the quality of the discovered types by guiding the classification of the new instance in the correct group and by refining the groups already built. We analytically and experimentally show that our approach dominates in terms of efficiency, competitors from both worlds, implicit type discovery and explicit type enrichment while outperforming them in most of the cases in terms of quality.

[1]  Andrea Giovanni Nuzzolese,et al.  Type inference through the analysis of Wikipedia links , 2012, LDOW.

[2]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[3]  Kenza Kellou-Menouer,et al.  On-line Versioned Schema Inference for Large Semantic Web Data Sources , 2017, SSDBM.

[4]  Steffen Staab,et al.  SchemEX - Efficient construction of a data catalogue by stream-based indexing of linked data , 2012, J. Web Semant..

[5]  Mourad Khayati,et al.  StaTIX — Statistical Type Inference on Linked Data , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[6]  Jeff Heflin,et al.  An Evaluation of Knowledge Base Systems for Large OWL Datasets , 2004, SEMWEB.

[7]  Dario Colazzo,et al.  Schema Inference for Massive JSON Datasets , 2017, EDBT.

[8]  Dimitris Plexousakis,et al.  Ontology evolution without tears , 2013, J. Web Semant..

[9]  Heiko Paulheim Browsing Linked Open Data with Auto Complete , 2012 .

[10]  Heiner Stuckenschmidt,et al.  Automated Fine-Grained Trust Assessment in Federated Knowledge Bases , 2017, International Semantic Web Conference.

[11]  Ke Wang,et al.  Schema Discovery for Semistructured Data , 1997, KDD.

[12]  Kostas Stefanidis,et al.  Coverage-Based Summaries for RDF KBs , 2021, ESWC.

[13]  Raja Chiky,et al.  FreGraPaD: Frequent RDF graph patterns detection for semantic data streams , 2016, 2016 IEEE Tenth International Conference on Research Challenges in Information Science (RCIS).

[14]  François Goasdoué,et al.  Summarizing semantic graphs: a survey , 2018, The VLDB Journal.

[15]  Nobutaka Suzuki,et al.  An Algorithm for Extracting Shape Expression Schemas from Graphs , 2019, DocEng.

[16]  Heiko Paulheim,et al.  Type Inference on Noisy RDF Data , 2013, SEMWEB.

[17]  Kenza Kellou-Menouer,et al.  Schema Discovery in RDF Data Sources , 2015, ER.

[18]  Dario Colazzo,et al.  Counting types for massive JSON datasets , 2017, DBPL.

[19]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[20]  Kostas Stefanidis,et al.  Exploring RDFS KBs Using Summaries , 2018, International Semantic Web Conference.

[21]  Claudio Lucchese,et al.  Summarizing Linked Data RDF Graphs Using Approximate Graph Pattern Mining , 2016, EDBT.

[22]  Jeff Z. Pan,et al.  Resource Description Framework , 2020, Definitions.

[23]  Kenza Kellou-Menouer,et al.  SchemaDecrypt++: Parallel on-line Versioned Schema Inference for Large Semantic Web Data sources , 2020, Inf. Syst..

[24]  François Goasdoué,et al.  Query-Oriented Summarization of RDF Graphs , 2015, Proc. VLDB Endow..

[25]  Kostas Stefanidis,et al.  Incremental Data Partitioning of RDF Data in SPARK , 2018, ESWC.

[26]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[27]  Pierre-Henri Paris,et al.  Revealing the Conceptual Schemas of RDF Datasets , 2019, CAiSE.

[28]  Kenza Kellou-Menouer,et al.  Scaling Up Schema Discovery for RDF Datasets , 2018, 2018 IEEE 34th International Conference on Data Engineering Workshops (ICDEW).

[29]  Jens Lehmann,et al.  Quality assessment for Linked Data: A Survey , 2015, Semantic Web.

[30]  Jesús García Molina,et al.  Inferring Versioned Schemas from NoSQL Databases and Its Applications , 2015, ER.

[31]  Dimitris Plexousakis,et al.  Ontology Evolution: Assisting Query Migration , 2012, ER.

[32]  Norman W. Paton,et al.  Structure Inference for Linked Data Sources Using Clustering , 2015, Trans. Large Scale Data Knowl. Centered Syst..

[33]  Haridimos Kondylakis,et al.  SOFOS: Demonstrating the Challenges of Materialized View Selection on Knowledge Graphs , 2021, SIGMOD Conference.

[34]  Kenza Kellou-Menouer,et al.  A Self-Adaptive and Incremental Approach for Data Profiling in the Semantic Web , 2016, Trans. Large Scale Data Knowl. Centered Syst..

[35]  Bu-Sung Lee,et al.  Formal Concept Discovery in Semantic Web Data , 2012, ICFCA.

[36]  Dimitris Plexousakis,et al.  Ontology Evolution in Data Integration: Query Rewriting to the Rescue , 2011, ER.

[37]  Dimitris Plexousakis,et al.  Exploring Importance Measures for Summarizing RDF/S KBs , 2017, ESWC.

[38]  Lu Fang,et al.  DBpedia Entity Type Inference Using Categories , 2016, International Semantic Web Conference.