Dragon: Decision Tree Learning for Link Discovery

The provision of links across RDF knowledge bases is regarded as fundamental to ensure that knowledge bases can be used joined to address real-world needs of applications. The growth of knowledge bases both with respect to their number and size demands the development of time-efficient and accurate approaches for the computation of such links. This is generally done with the aid of machine learning approaches, such as e.g. Decision Trees. While Decision Trees are known to be fast, they are generally outperformed in the link discovery task by the state-of-the-art in terms of quality, i.e. F-measure. In this work, we present Dragon, a fast decision-tree-based approach that is both efficient and accurate. Our approach was evaluated by comparing it with state-of-the-art link discovery approaches as well as the common decision-tree-learning approach J48. Our results suggest that our approach achieves state-of-the-art performance with respect to its F-measure while being 18 times faster on average than existing algorithms for link discovery on RDF knowledge bases. Furthermore, we investigate why Dragon significantly outperforms J48 in terms of link accuracy. We provide an open-source implementation of our algorithm in the LIMES framework.

[1]  Robert Isele,et al.  Learning linkage rules using genetic programming , 2011, OM.

[2]  Jens Lehmann,et al.  LODStats: The Data Web Census Dataset , 2016, SEMWEB.

[3]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[4]  Axel-Cyrille Ngonga Ngomo,et al.  Unsupervised learning of link specifications: deterministic vs. non-deterministic , 2013, OM.

[5]  Sören Auer,et al.  LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data , 2011, IJCAI.

[6]  Martin Gaedke,et al.  Silk - A Link Discovery Framework for the Web of Data , 2009, LDOW.

[7]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[8]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[9]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[10]  Axel-Cyrille Ngonga Ngomo,et al.  A comparison of supervised learning classifiers for link discovery , 2014, SEM '14.

[11]  Irini Fundulaki,et al.  Instance matching benchmarks in the era of Linked Data , 2016, J. Web Semant..

[12]  Jens Lehmann,et al.  RAVEN - active learning of link specifications , 2011, OM.

[13]  Markus Nentwig,et al.  A survey of current Link Discovery frameworks , 2016, Semantic Web.

[14]  Jens Lehmann,et al.  Wombat - A Generalization Approach for Automatic Link Discovery , 2017, ESWC.

[15]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[16]  Robert Isele,et al.  Efficient Multidimensional Blocking for Link Discovery without losing Recall , 2011, WebDB.

[17]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[18]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[19]  Enrico Motta,et al.  Unsupervised Learning of Link Discovery Configuration , 2012, ESWC.

[20]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[21]  Robert Isele,et al.  Learning Expressive Linkage Rules using Genetic Programming , 2012, Proc. VLDB Endow..

[22]  Axel-Cyrille Ngonga Ngomo,et al.  On Link Discovery using a Hybrid Approach , 2012, Journal on Data Semantics.

[23]  Axel-Cyrille Ngonga Ngomo,et al.  EAGLE: Efficient Active Learning of Link Specifications Using Genetic Programming , 2012, ESWC.

[24]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[25]  John Yearwood,et al.  Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80 , 2008 .

[26]  Dennis Shasha,et al.  Efficient data reconciliation , 2001, Inf. Sci..

[27]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[28]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[29]  Daniel P. Miranker,et al.  Semi-supervised Instance Matching Using Boosted Classifiers , 2015, ESWC.

[30]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[31]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[32]  Peter Christen,et al.  Febrl: a freely available record linkage system with a graphical user interface , 2008 .

[33]  Axel-Cyrille Ngonga Ngomo,et al.  COALA - Correlation-Aware Active Learning of Link Specifications , 2013, ESWC.

[34]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .