EMBench++: Data for a thorough benchmarking of matching-related methods

Matching-related methods, i.e., entity resolution, entity search, or detecting evolution of entities, are essential parts in a variety of applications. The specific research area contains a plethora of methods focusing on efficiently and effectively detecting whether two different pieces of information describe the same real world object or, in the case of entity search and evolution, retrieving the entities of a given collection that best match the user’s description. A primary limitation of the particular research area is the lack of a widely accepted benchmark for performing extensive experimental evaluation of the proposed methods, including not only the accuracy of results but also scalability as well as performance given different data characteristics. This paper introduces EMBench, a principled system that can be used for generating benchmark data for the extensive evaluation of matching-related methods. Our tool is a continuation of a previous system, with the primary contributions including: modifiers that consider not only individual entity types but all available types according to the overall schema; techniques supporting the evolution of entities; and mechanisms for controlling the generation of not single data sets but collections of data sets. We also illustrate collections of entity sets generated by EMBench and discuss the benefits of using our system through the results of an experimental evaluation.

[1]  Axel-Cyrille Ngonga Ngomo,et al.  On the efficient execution of bounded Jaro-Winkler distances , 2016, Semantic Web.

[2]  Pascal Hitzler,et al.  String Similarity Metrics for Ontology Alignment , 2013, SEMWEB.

[3]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[4]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[5]  Barbara Pernici Advanced Information Systems Engineering, 22nd International Conference, CAiSE 2010, Hammamet, Tunisia, June 7-9, 2010. Proceedings , 1998, CAiSE.

[6]  Minos N. Garofalakis,et al.  Holistic Query Evaluation over Information Extraction Pipelines , 2017, Proc. VLDB Endow..

[7]  Dmitri V. Kalashnikov,et al.  Exploiting Relationships for Domain-Independent Data Cleaning , 2005, SDM.

[8]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[9]  Claudia Niederée,et al.  Eliminating the redundancy in blocking-based entity resolution methods , 2011, JCDL '11.

[10]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[11]  Wolfgang Nejdl,et al.  Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[12]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[13]  Wolfgang Nejdl,et al.  Unsupervised Link Discovery Through Knowledge Base Repair , 2014 .

[14]  Divesh Srivastava,et al.  Linking temporal records , 2011, Frontiers of Computer Science.

[15]  Wang Chiew Tan,et al.  STBenchmark: towards a benchmark for mapping systems , 2008, Proc. VLDB Endow..

[16]  Tzanina Saveta SPIMBench: A Scalable, Schema-Aware Instance Matching Benchmark for the Semantic Publishing Domain , 2014 .

[17]  Jiawei Han,et al.  Object Matching for Information Integration: A Profiler-Based Approach , 2003, IIWeb.

[18]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[19]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[20]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[21]  Jens Lehmann,et al.  LinkedGeoData: A core for a web of spatial open data , 2012, Semantic Web.

[22]  John Mylopoulos,et al.  Modeling Concept Evolution: A Historical Perspective , 2009, ER.

[23]  Heiner Stuckenschmidt,et al.  Benchmarking Matching Applications on the Semantic Web , 2011, ESWC.

[24]  Axel-Cyrille Ngonga Ngomo,et al.  LANCE: Piercing to the Heart of Instance Matching Tools , 2015, SEMWEB.

[25]  Dmitri V. Kalashnikov,et al.  Domain-independent data cleaning via analysis of entity-relationship graph , 2006, TODS.

[26]  Lise Getoor,et al.  Deduplication and Group Detection using Links , 2004 .

[27]  Hector Garcia-Molina,et al.  Incremental entity resolution on rules and data , 2014, The VLDB Journal.

[28]  Ekaterini Ioannou,et al.  EMBench: Generating Entity-Related Benchmark Data , 2014, International Semantic Web Conference.

[29]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[30]  Paolo Bouquet,et al.  Entity Identification on the Semantic Web , 2008, SWAP.

[31]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[32]  Claudia Niederée,et al.  Detecting and exploiting stability in evolving heterogeneous information spaces , 2011, JCDL '11.

[33]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[34]  Ekaterini Ioannou,et al.  On Generating Benchmark Data for Entity Matching , 2012, Journal on Data Semantics.

[35]  Divesh Srivastava,et al.  Incremental Record Linkage , 2014, Proc. VLDB Endow..

[36]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[37]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[38]  Lise Getoor,et al.  Link mining: a survey , 2005, SKDD.

[39]  Peter Fankhauser,et al.  From Web Data to Entities and Back , 2010, CAiSE.

[40]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[41]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[42]  Claudia Niederée,et al.  On-the-fly entity-aware query processing in the presence of linkage , 2010, Proc. VLDB Endow..

[43]  Nilesh N. Dalvi,et al.  Large-Scale Collective Entity Matching , 2011, Proc. VLDB Endow..

[44]  Heiner Stuckenschmidt,et al.  Results of the Ontology Alignment Evaluation Initiative 2007 , 2006, OM.

[45]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[46]  Ekaterini Ioannou,et al.  Management of Inconsistencies in Data Integration , 2013, Data Exchange, Information, and Streams.

[47]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.