Link Discovery with Guaranteed Reduction Ratio in Affine Spaces with Minkowski Measures

Time-efficient algorithms are essential to address the complex linking tasks that arise when trying to discover links on the Web of Data. Although several lossless approaches have been developed for this exact purpose, they do not offer theoretical guarantees with respect to their performance. In this paper, we address this drawback by presenting the first Link Discovery approach with theoretical quality guarantees. In particular, we prove that given an achievable reduction ratio r, our Link Discovery approach $\mathcal{HR}^3$ can achieve a reduction ratio r′≤r in a metric space where distances are measured by the means of a Minkowski metric of any order p≥2. We compare $\mathcal{HR}^3$ and the HYPPO algorithm implemented in LIMES 0.5 with respect to the number of comparisons they carry out. In addition, we compare our approach with the algorithms implemented in the state-of-the-art frameworks LIMES 0.5 and SILK 2.5 with respect to runtime. We show that $\mathcal{HR}^3$ outperforms these previous approaches with respect to runtime in each of our four experimental setups.

[1]  Mark B. Sandler,et al.  Automatic Interlinking of Music Datasets on the Semantic Web , 2008, LDOW.

[2]  J. Heflin,et al.  Scaling Data Linkage Generation with Domain-Independent Candidate Selection , 2011 .

[3]  Lora Aroyo,et al.  The Semantic Web - ISWC 2011 - 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I , 2011, SEMWEB.

[4]  Enrico Motta,et al.  Unsupervised Learning of Link Discovery Configuration , 2012, ESWC.

[5]  Rajkumar Buyya,et al.  Dynamically scaling applications in the cloud , 2011, CCRV.

[6]  Axel-Cyrille Ngonga Ngomo,et al.  A time-efficient hybrid approach to link discovery , 2011, OM.

[7]  Hugh Glaser,et al.  Research on Linked Data and Co-reference Resolution , 2009, Dublin Core Conference.

[8]  Axel-Cyrille Ngonga Ngomo,et al.  EAGLE: Efficient Active Learning of Link Specifications Using Genetic Programming , 2012, ESWC.

[9]  Lora Aroyo,et al.  The Semantic Web: Research and Applications , 2009, Lecture Notes in Computer Science.

[10]  Tamar Domany,et al.  Enterprise Data Classification Using Semantic Web Technologies , 2010, SEMWEB.

[11]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[12]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[13]  Nivio Ziviani,et al.  External perfect hashing for very large key sets , 2007, CIKM '07.

[14]  Enrico Motta,et al.  Cross ontology query answering on the semantic web: an initial evaluation , 2009, K-CAP '09.

[15]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[16]  Jens Lehmann,et al.  RAVEN - active learning of link specifications , 2011, OM.

[17]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[18]  Robert Isele,et al.  Efficient Multidimensional Blocking for Link Discovery without losing Recall , 2011, WebDB.

[19]  Andreas Thor,et al.  Comparative evaluation of entity resolution approaches with FEVER , 2009, Proc. VLDB Endow..

[20]  Sören Auer,et al.  LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data , 2011, IJCAI.

[21]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[22]  Robert Isele,et al.  Learning linkage rules using genetic programming , 2011, OM.

[23]  Ian Horrocks,et al.  The Semantic Web – ISWC 2010: 9th International Semantic Web Conference, ISWC 2010, Shanghai, China, November 7-11, 2010, Revised Selected Papers, Part I , 2010, SEMWEB.

[24]  Katja Hose,et al.  FedX: Optimization Techniques for Federated Query Processing on Linked Data , 2011, SEMWEB.

[25]  Jens Lehmann,et al.  Introduction to Linked Data and Its Lifecycle on the Web , 2013, Reasoning Web.

[26]  Frank van Harmelen,et al.  OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples , 2010, ESWC.