An unsupervised instance matcher for schema-free RDF data

This article presents an unsupervised system that performs instance matching between entities in schema-free Resource Description Framework (RDF) files. Rather than relying on domain expertise or manually labeled samples, the system automatically generates its own heuristic training set. The training sets are first used by the system to align the properties in the input graphs. The property alignment and training sets are used together to simultaneously learn two functions, one for the blocking step of instance matching and the other for the classification step. Finally, the learned functions are used to perform instance matching. The full system is implemented as a sequence of components that can be iteratively executed to boost performance. Evaluations on a suite of ten test cases show individual components to be competitive with state-of-the-art baselines. The system as a whole is shown to compete effectively with adaptive supervised approaches.

[1]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[2]  Axel-Cyrille Ngonga Ngomo,et al.  Unsupervised learning of link specifications: deterministic vs. non-deterministic , 2013, OM.

[3]  Dan Brickley,et al.  Resource Description Framework (RDF) Model and Syntax Specification , 2002 .

[4]  Jens Lehmann,et al.  RAVEN - active learning of link specifications , 2011, OM.

[5]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[6]  Yi Li,et al.  RiMOM: A Dynamic Multistrategy Ontology Alignment Framework , 2009, IEEE Transactions on Knowledge and Data Engineering.

[7]  Felix Naumann,et al.  Schema matching using duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[8]  Dongwon Lee,et al.  HARRA: fast iterative hashed record linkage for large-scale data collections , 2010, EDBT '10.

[9]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[10]  François Scharffe,et al.  Data Linking for the Semantic Web , 2011, Int. J. Semantic Web Inf. Syst..

[11]  Daniel P. Miranker,et al.  Schema matching over relations, attributes, and data values , 2014, SSDBM '14.

[12]  Jan Nößner,et al.  CODI: Combinatorial Optimization for Data Integration: results for OAEI 2011 , 2010, OM.

[13]  Enrico Motta,et al.  Handling Instance Coreferencing in the KnoFuss Architecture , 2008, IRSW.

[14]  Nathalie Pernelle,et al.  LN2R a knowledge based reference reconciliation system: OAEI 2010 results , 2010, OM.

[15]  Charles Stephenson,et al.  The Methodology of Historical Census Record Linkage: a User's Guide To the Soundex , 1980 .

[16]  David Peleg Approximation algorithms for the Label-CoverMAX and Red-Blue Set Cover problems , 2007, J. Discrete Algorithms.

[17]  Peter Fankhauser,et al.  The missing links: discovering hidden same-as links among a billion of triples , 2010, iiWAS.

[18]  Amit P. Sheth,et al.  A statistical and schema independent approach to identify equivalent properties on linked data , 2013, I-SEMANTICS '13.

[19]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[20]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[21]  Robert W. Irving,et al.  The Stable marriage problem - structure and algorithms , 1989, Foundations of computing series.

[22]  Robert Isele,et al.  Efficient Multidimensional Blocking for Link Discovery without losing Recall , 2011, WebDB.

[23]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[24]  Axel-Cyrille Ngonga Ngomo,et al.  HELIOS - Execution Optimization for Link Discovery , 2014, SEMWEB.

[25]  Harald Sack The Semantic Web. Latest Advances and New Domains , 2016, Lecture Notes in Computer Science.

[26]  Robert D. Carr,et al.  On the red-blue set cover problem , 2000, SODA '00.

[27]  Achille Fokoue,et al.  Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing , 2012, SEMWEB.

[28]  Dave Reynolds,et al.  Efficient RDF Storage and Retrieval in Jena2 , 2003, SWDB.

[29]  Enrico Motta,et al.  Unsupervised Learning of Link Discovery Configuration , 2012, ESWC.

[30]  Axel-Cyrille Ngonga Ngomo,et al.  EAGLE: Efficient Active Learning of Link Specifications Using Genetic Programming , 2012, ESWC.

[31]  Daniel P. Miranker,et al.  Semi-supervised Instance Matching Using Boosted Classifiers , 2015, ESWC.

[32]  Heiner Stuckenschmidt,et al.  Benchmarking Matching Applications on the Semantic Web , 2011, ESWC.

[33]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[34]  Daniel P. Miranker,et al.  A two-step blocking scheme learner for scalable link discovery , 2014, OM.

[35]  Lise Getoor,et al.  Knowledge Graph Identification , 2013, SEMWEB.

[36]  ChristenPeter A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012 .

[37]  Noël Crespi,et al.  Semantic Context-Aware Service Composition for Building Automation System , 2014, IEEE Transactions on Industrial Informatics.

[38]  Jakub Simko,et al.  Data linking for the Semantic Web , 2015 .

[39]  Sherif Sakr,et al.  Relational processing of RDF queries: a survey , 2010, SGMD.

[40]  Alfio Ferrara,et al.  Towards a Benchmark for Instance Matching , 2008, OM.

[41]  J StolfoSalvatore,et al.  The merge/purge problem for large databases , 1995 .

[42]  Daniel P. Miranker,et al.  A DNF Blocking Scheme Learner for Heterogeneous Datasets , 2015, ArXiv.

[43]  Daniel P. Miranker,et al.  An Unsupervised Algorithm for Learning Blocking Schemes , 2013, 2013 IEEE 13th International Conference on Data Mining.

[44]  Mayank Kejriwal,et al.  Populating Entity Name Systems for Big Data Integration , 2014, SEMWEB.

[45]  Qiang Yang,et al.  A Machine Learning Approach for Instance Matching Based on Similarity Metrics , 2012, SEMWEB.

[46]  W. Winkler IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[47]  R. Jonker,et al.  Improving the Hungarian assignment algorithm , 1986 .

[48]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[49]  Ashwin Machanavajjhala,et al.  Network sampling , 2013, KDD.

[50]  H. B. Newcombe,et al.  Computers can be used to extract "follow-up" statistics of families from files of routine records. , 1959 .

[51]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[52]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[53]  Stefanos D. Kollias,et al.  Uncertainty and the Semantic Web , 2006, IEEE Intelligent Systems.

[54]  Hugh Glaser,et al.  Research on Linked Data and Co-reference Resolution , 2009, Dublin Core Conference.

[55]  Daniel P. Miranker,et al.  Ultrawrap: SPARQL execution on relational data , 2013, J. Web Semant..

[56]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[57]  Yongtao Ma,et al.  TYPifier: Inferring the type semantics of structured data , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[58]  Serge Abiteboul,et al.  PARIS: Probabilistic Alignment of Relations, Instances, and Schema , 2011, Proc. VLDB Endow..

[59]  Paolo Bouquet,et al.  OKKAM: Enabling a Web of Entities , 2007, I3.

[60]  Peter Christen,et al.  Febrl - A Parallel Open Source Data Linkage System: http://datamining.anu.edu.au/linkage.html , 2004, PAKDD.

[61]  Axel-Cyrille Ngonga Ngomo,et al.  A time-efficient hybrid approach to link discovery , 2011, OM.

[62]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[63]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[64]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[65]  Heiko Paulheim,et al.  Adoption of the Linked Data Best Practices in Different Topical Domains , 2014, SEMWEB.

[66]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[67]  Hwee Tou Ng,et al.  A Machine Learning Approach to Coreference Resolution of Noun Phrases , 2001, CL.

[68]  Axel-Cyrille Ngonga Ngomo,et al.  A comparison of supervised learning classifiers for link discovery , 2014, SEM '14.

[69]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[70]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[71]  Ran Raz,et al.  A sub-constant error-probability low-degree test, and a sub-constant error-probability PCP characterization of NP , 1997, STOC '97.

[72]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[73]  Martin Gaedke,et al.  Discovering and Maintaining Links on the Web of Data , 2009, SEMWEB.

[74]  Mansur R. Kabuka,et al.  ASMOV : Ontology Alignment with Semantic Validation , 2007 .

[75]  Daniel P. Miranker,et al.  Sorted Neighborhood for Schema-Free RDF Data , 2015, ESWC.

[76]  Yuzhong Qu,et al.  A self-training approach for resolving object coreference on the semantic web , 2011, WWW.

[77]  Axel-Cyrille Ngonga Ngomo,et al.  COALA - Correlation-Aware Active Learning of Link Specifications , 2013, ESWC.

[78]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[79]  Mark B. Sandler,et al.  Automatic Interlinking of Music Datasets on the Semantic Web , 2008, LDOW.

[80]  William W. Cohen WHIRL: A word-based information representation language , 2000, Artif. Intell..

[81]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[82]  Yongtao Ma,et al.  TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration , 2013, WSDM.

[83]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[84]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[85]  Axel-Cyrille Ngonga Ngomo,et al.  On Link Discovery using a Hybrid Approach , 2012, Journal on Data Semantics.