A Principled Approach to Bridging the Gap between Graph Data and their Schemas

Although RDF graph data often come with an associated schema, recent studies have proven that real RDF data rarely conform to their perceived schemas. Since a number of data management decisions, including storage layouts, indexing, and efficient query processing, use schemas to guide the decision making, it is imperative to have an accurate description of the structuredness of the data at hand (how well the data conform to the schema). In this paper, we have approached the study of the structuredness of an RDF graph in a principled way: we propose a framework for specifying structuredness functions, which gauge the degree to which an RDF graph conforms to a schema. In particular, we first define a formal language for specifying structuredness functions with expressions we call rules. This language allows a user to state a rule to which an RDF graph may fully or partially conform. Then we consider the issue of discovering a refinement of a sort (type) by partitioning the dataset into subsets whose structuredness is over a specified threshold. In particular, we prove that the natural decision problem associated to this refinement problem is NP-complete, and we provide a natural translation of this problem into Integer Linear Programming (ILP). Finally, we test this ILP solution with three real world datasets and three different and intuitive rules, which gauge the structuredness in different ways. We show that the rules give meaningful refinements of the datasets, showing that our language can be a powerful tool for understanding the structure of RDF data, and we show that the ILP solution is practical for a large fraction of existing data.

[1]  Jens Lehmann,et al.  Learning OWL Class Expressions , 2010, Studies on the Semantic Web.

[2]  Wenjun Yuan,et al.  Automating Relational Database Schema Design for Very Large Semantic Datasets , 2013 .

[3]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[4]  Kevin Wilkinson,et al.  Application-Specific Schema Design for Storing Large RDF Datasets , 2003, PSSS.

[5]  Johanna Völker,et al.  Statistical Schema Induction , 2011, ESWC.

[6]  Jeff Z. Pan,et al.  The Semantic Web: Research and Applications - 8th Extended Semantic Web Conference, ESWC 2011, Heraklion, Crete, Greece, May 29-June 2, 2011, Proceedings, Part I , 2010, ESWC.

[7]  Nicola Fanizzi,et al.  Inductive learning for the Semantic Web: What does it buy? , 2010, Semantic Web.

[8]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[9]  Jeff Heflin,et al.  DLDB: Extending Relational Databases to Support Semantic Web Queries , 2003, PSSS.

[10]  Alexander Maedche,et al.  Clustering Ontology-Based Metadata in the Semantic Web , 2002, PKDD.

[11]  Octavian Udrea,et al.  Apples and oranges: a comparison of RDF benchmarks and real RDF datasets , 2011, SIGMOD '11.

[12]  Alun D. Preece,et al.  Learning Meta-descriptions of the FOAF Network , 2004, SEMWEB.

[13]  Catherine Faron-Zucker,et al.  Learning Ontologies from RDF annotations , 2001, Workshop on Ontology Learning.

[14]  Mohamed F. Mokbel,et al.  RDF Data-Centric Storage , 2009, 2009 IEEE International Conference on Web Services.

[15]  Catherine Faron-Zucker,et al.  Learning ontologies from RDF annotation , 2001 .