ROCKER: A Refinement Operator for Key Discovery

The Linked Data principles provide a decentral approach for publishing structured data in the RDF format on the Web. In contrast to structured data published in relational databases where a key is often provided explicitly, finding a set of properties that allows identifying a resource uniquely is a non-trivial task. Still, finding keys is of central importance for manifold applications such as resource deduplication, link discovery, logical data compression and data integration. In this paper, we address this research gap by specifying a refinement operator, dubbed ROCKER, which we prove to be finite, proper and non-redundant. We combine the theoretical characteristics of this operator with two monotonicities of keys to obtain a time-efficient approach for detecting keys, i.e., sets of properties that describe resources uniquely. We then utilize a hash index to compute the discriminability score efficiently. Therewith, we ensure that our approach can scale to very large knowledge bases. Results show that ROCKER yields more accurate results, has a comparable runtime, and consumes less memory w.r.t. existing state-of-the-art techniques.

[1]  Nathalie Pernelle,et al.  An automatic key discovery approach for data linking , 2013, J. Web Semant..

[2]  Jens Lehmann,et al.  RAVEN - active learning of link specifications , 2011, OM.

[3]  Jérôme David,et al.  Data interlinking through robust linkkey extraction , 2014, ECAI.

[4]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[5]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[6]  Jens Lehmann,et al.  Concept learning in description logics using refinement operators , 2009, Machine Learning.

[7]  Nathalie Pernelle,et al.  SAKey: Scalable Almost Key Discovery in RDF Data , 2014, SEMWEB.

[8]  Nathalie Pernelle,et al.  Combining a Logical and a Numerical Method for Data Reconciliation , 2009, J. Data Semant..

[9]  Heikki Mannila,et al.  Algorithms for Inferring Functional Dependencies from Relations , 1994, Data Knowl. Eng..

[10]  Jens Lehmann,et al.  Introduction to Linked Data and Its Lifecycle on the Web , 2013, Reasoning Web.

[11]  Karin K. Breitman,et al.  Towards an Efficient RDF Dataset Slicing , 2013, Int. J. Semantic Comput..

[12]  Axel-Cyrille Ngonga Ngomo,et al.  Active Learning of Domain-Specific Distances for Link Discovery , 2012, JIST.

[13]  Jeff Heflin,et al.  Automatically Generating Data Linkages Using a Domain-Independent Candidate Selection Approach , 2011, SEMWEB.

[14]  Stefan Decker,et al.  Linked cancer genome atlas database , 2013, I-SEMANTICS '13.

[15]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[16]  Enrico Motta,et al.  Unsupervised Learning of Link Discovery Configuration , 2012, ESWC.

[17]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[18]  Jens Lehmann,et al.  LinkedGeoData: A core for a web of spatial open data , 2012, Semantic Web.

[19]  Axel-Cyrille Ngonga Ngomo,et al.  A time-efficient hybrid approach to link discovery , 2011, OM.

[20]  Veda C. Storey,et al.  Reverse Engineering of Relational Databases: Extraction of an EER Model from a Relational Database , 1994, Data Knowl. Eng..

[21]  Axel-Cyrille Ngonga Ngomo,et al.  Link Discovery with Guaranteed Reduction Ratio in Affine Spaces with Minkowski Measures , 2012, SEMWEB.

[22]  Pascal Hitzler,et al.  String Similarity Metrics for Ontology Alignment , 2013, SEMWEB.

[23]  Sören Auer,et al.  LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data , 2011, IJCAI.