Active Learning of Domain-Specific Distances for Link Discovery

Discovering cross-knowledge-base links is of central importance for manifold tasks across the Linked Data Web. So far, learning link specifications has been addressed by approaches that rely on standard similarity and distance measures such as the Levenshtein distance for strings and the Euclidean distance for numeric values. While these approaches have been shown to perform well, the use of standard similarity measure still hampers their accuracy, as several link discovery tasks can only be solved sub-optimally when relying on standard measures. In this paper, we address this drawback by presenting a novel approach to learning string similarity measures concurrently across multiple dimensions directly from labeled data. Our approach is based on learning linear classifiers which rely on learned edit distance within an active learning setting. By using this combination of paradigms, we can ensure that we reduce the labeling burden on the experts at hand while achieving superior results on datasets for which edit distances are useful. We evaluate our approach on three different real datasets and show that our approach can improve the accuracy of classifiers. We also discuss how our approach can be extended to other similarity and distance measures as well as different classifiers.

[1]  Axel-Cyrille Ngonga Ngomo,et al.  A time-efficient hybrid approach to link discovery , 2011, OM.

[2]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[3]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[4]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[5]  Maria-Florina Balcan,et al.  Improved Guarantees for Learning via Similarity Functions , 2008, COLT.

[6]  Lora Aroyo,et al.  The Semantic Web: Research and Applications , 2009, Lecture Notes in Computer Science.

[7]  Mark B. Sandler,et al.  Automatic Interlinking of Music Datasets on the Semantic Web , 2008, LDOW.

[8]  Robert Isele,et al.  Efficient Multidimensional Blocking for Link Discovery without losing Recall , 2011, WebDB.

[9]  Sören Auer,et al.  LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data , 2011, IJCAI.

[10]  Jens Lehmann,et al.  Introduction to Linked Data and Its Lifecycle on the Web , 2013, Reasoning Web.

[11]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[12]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[13]  Robert E. Tarjan,et al.  Fibonacci heaps and their uses in improved network optimization algorithms , 1987, JACM.

[14]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[15]  Jens Lehmann,et al.  RAVEN - active learning of link specifications , 2011, OM.

[16]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.

[18]  Axel-Cyrille Ngonga Ngomo,et al.  EAGLE: Efficient Active Learning of Link Specifications Using Genetic Programming , 2012, ESWC.

[19]  Guoliang Li,et al.  PASS-JOIN: A Partition-based Method for Similarity Joins , 2011, Proc. VLDB Endow..

[20]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[21]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[22]  Marc Sebban,et al.  Learning Good Edit Similarities with Generalization Guarantees , 2011, ECML/PKDD.

[23]  Robert Isele,et al.  Learning linkage rules using genetic programming , 2011, OM.

[24]  Jérôme Euzenat,et al.  Ontology Matching: State of the Art and Future Challenges , 2013, IEEE Transactions on Knowledge and Data Engineering.

[25]  Enrico Motta,et al.  Unsupervised Learning of Link Discovery Configuration , 2012, ESWC.