SAKey: Scalable Almost Key Discovery in RDF Data

Exploiting identity links among RDF resources allows applications to efficiently integrate data. Keys can be very useful to discover these identity links. A set of properties is considered as a key when its values uniquely identify resources. However, these keys are usually not available. The approaches that attempt to automatically discover keys can easily be overwhelmed by the size of the data and require clean data. We present SAKey, an approach that discovers keys in RDF data in an efficient way. To prune the search space, SAKey exploits characteristics of the data that are dynamically detected during the process. Furthermore, our approach can discover keys in datasets where erroneous data or duplicates exist (i.e., almost keys). The approach has been evaluated on different synthetic and real datasets. The results show both the relevance of almost keys and the efficiency of discovering them.

[1]  Abraham Bernstein,et al.  The Semantic Web - ISWC 2009, 8th International Semantic Web Conference, ISWC 2009, Chantilly, VA, USA, October 25-29, 2009. Proceedings , 2009, SEMWEB.

[2]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[3]  Bernardo Cuenca Grau,et al.  OWL 2 Web Ontology Language: Direct Semantics , 2009 .

[4]  Richard M. Karp,et al.  Reducibility among combinatorial problems" in complexity of computer computations , 1972 .

[5]  Paul Brown,et al.  GORDIAN: efficient and scalable discovery of composite keys , 2006, VLDB.

[6]  Nathalie Pernelle,et al.  Combining a Logical and a Numerical Method for Data Reconciliation , 2009, J. Data Semant..

[7]  Rina Dechter,et al.  Constraint Processing , 1995, Lecture Notes in Computer Science.

[8]  Raymond E. Miller,et al.  Complexity of Computer Computations , 1972 .

[9]  Jérôme David,et al.  Keys and Pseudo-Keys Detection for Web Datasets Cleansing and Interlinking , 2012, EKAW.

[10]  Enrico Motta,et al.  Data linking: capturing and utilising implicit schema-level relations , 2010, LDOW.

[11]  Ryutaro Ichise,et al.  Integrating Know-How into the Linked Data Cloud , 2014, EKAW.

[12]  J. Heflin,et al.  Scaling Data Linkage Generation with Domain-Independent Candidate Selection , 2011 .

[13]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[14]  Felix Naumann,et al.  Scalable Discovery of Unique Column Combinations , 2013, Proc. VLDB Endow..

[15]  Martin Gaedke,et al.  Discovering and Maintaining Links on the Web of Data , 2009, SEMWEB.

[16]  Tok Wang Ling,et al.  A knowledge-based approach for duplicate elimination in data cleaning , 2001, Inf. Syst..

[17]  François Scharffe,et al.  Data Linking for the Semantic Web , 2011, Int. J. Semantic Web Inf. Syst..

[18]  Nathalie Pernelle,et al.  An automatic key discovery approach for data linking , 2013, J. Web Semant..

[19]  Daisy Zhe Wang,et al.  Functional Dependency Generation and Applications in Pay-As-You-Go Data Integration Systems , 2009, WebDB.

[20]  Nathalie Pernelle,et al.  Defining Key Semantics for the RDF Datasets: Experiments and Evaluations , 2014, ICCS.

[21]  Lora Aroyo,et al.  The Semantic Web - ISWC 2011 - 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I , 2011, SEMWEB.

[22]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[23]  Yuzhong Qu,et al.  A self-training approach for resolving object coreference on the semantic web , 2011, WWW.