Learning linkage rules using genetic programming

An important problem in Linked Data is the discovery of links between entities which identify the same real world object. These links are often generated based on manually written linkage rules which specify the condition which must be fulfilled for two entities in order to be interlinked. In this paper, we present an approach to automatically generate linkage rules from a set of reference links. Our approach is based on genetic programming and has been implemented in the Silk Link Discovery Framework. It is capable of generating complex linkage rules which compare multiple properties of the entities and employ data transformations in order to normalize their values. Experimental results show that it outperforms a genetic programming approach for record deduplication recently presented by Carvalho et. al. In tests with linkage rules that have been created for our research projects our approach learned rules which achieve a similar accuracy than the original human-created linkage rule.

[1]  K. Dejong,et al.  An analysis of the behavior of a class of genetic adaptive systems , 1975 .

[2]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[3]  Kenneth Alan De Jong,et al.  An analysis of the behavior of a class of genetic adaptive systems. , 1975 .

[4]  Nichael Lynn Cramer,et al.  A Representation for the Adaptive Generation of Simple Sequential Programs , 1985, ICGA.

[5]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[6]  Lothar Thiele,et al.  Genetic Programming and Redundancy , 1994 .

[7]  Byoung-Tak Zhang,et al.  Balancing Accuracy and Parsimony in Genetic Programming , 1995, Evolutionary Computation.

[8]  Terry Jones,et al.  Crossover, Macromutationand, and Population-Based Search , 1995, ICGA.

[9]  David J. Montana,et al.  Strongly Typed Genetic Programming , 1995, Evolutionary Computation.

[10]  Riccardo Poli,et al.  Fitness Causes Bloat , 1998 .

[11]  N. Swamy,et al.  Finding a better-than-classical quantum AND/OR algorithm using genetic programming , 1999, Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406).

[12]  John Levine,et al.  Investigation of Different Seeding Strategies in a Genetic Planner , 2001, EvoWorkshops.

[13]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[14]  R. Mooney,et al.  Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases , 2002 .

[15]  William E. Winkler,et al.  Methods for Record Linkage and Bayesian Networks , 2002 .

[16]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[17]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[18]  John R. Koza,et al.  What's AI Done for Me Lately? Genetic Programming's Human-Competitive Results , 2003, IEEE Intell. Syst..

[19]  John R. Koza,et al.  Genetic Programming IV: Routine Human-Competitive Machine Intelligence , 2003 .

[20]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[21]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[22]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[23]  John R. Koza,et al.  Automatic Creation of Human-Competitive Programs and Controllers by Means of Genetic Programming , 2000, Genetic Programming and Evolvable Machines.

[24]  Marcos André Gonçalves,et al.  Learning to deduplicate , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[25]  Heiner Stuckenschmidt,et al.  Results of the Ontology Alignment Evaluation Initiative 2007 , 2006, OM.

[26]  Heiner Stuckenschmidt,et al.  Results of the Ontology Alignment Evaluation Initiative , 2007 .

[27]  Riccardo Poli,et al.  A Field Guide to Genetic Programming , 2008 .

[28]  Marcos André Gonçalves,et al.  Replica identification using genetic programming , 2008, SAC '08.

[29]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[30]  Robert Isele,et al.  Silk Server - Adding missing Links while consuming Linked Data , 2010, COLD.

[31]  William E. Winkler 20. Matching and Record Linkage , 2011 .

[32]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[33]  Marcos André Gonçalves,et al.  A Genetic Programming Approach to Record Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[34]  J. Euzenat,et al.  Ontology Matching , 2007, Springer Berlin Heidelberg.