Synthesizing Entity Matching Rules by Examples

Entity matching (EM) is a critical part of data integration. We study how to synthesize entity matching rules from positive-negative matching examples. The core of our solution is program synthesis, a powerful tool to automatically generate rules (or programs) that satisfy a given high-level specification, via a predefined grammar. This grammar describes a General Boolean Formula (GBF) that can include arbitrary attribute matching predicates combined by conjunctions (∧), disjunctions (∨) and negations (¬), and is expressive enough to model EM problems, from capturing arbitrary attribute combinations to handling missing attribute values. The rules in the form of GBF are more concise than traditional EM rules represented in Disjunctive Normal Form (DNF). Consequently, they are more interpretable than decision trees and other machine learning algorithms that output deep trees with many branches. We present a new synthesis algorithm that, given only positive-negative examples as input, synthesizes EM rules that are effective over the entire dataset. Extensive experiments show that we outperform other interpretable rules (e.g., decision trees with low depth) in effectiveness, and are comparable with non-interpretable tools (e.g., decision trees with high depth, gradient-boosting trees, random forests and SVM).

[1]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[2]  Robert Wille,et al.  From Truth Tables to Programming Languages: Progress in the Design of Reversible Circuits , 2011, 2011 41st IEEE International Symposium on Multiple-Valued Logic.

[3]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[4]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[5]  Rajeev Alur,et al.  Syntax-guided synthesis , 2013, 2013 Formal Methods in Computer-Aided Design.

[6]  AnHai Doan,et al.  Why Big Data Industrial Systems Need Rules and What We Can Do About It , 2015, SIGMOD Conference.

[7]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[8]  Vipin Kumar,et al.  Optimizing F-Measure with Support Vector Machines , 2003, FLAIRS Conference.

[9]  Paolo Papotti,et al.  Generating Concise Entity Matching Rules , 2017, SIGMOD Conference.

[10]  Regina Barzilay,et al.  Rationalizing Neural Predictions , 2016, EMNLP.

[11]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[12]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[14]  Divesh Srivastava,et al.  Online Entity Resolution Using an Oracle , 2016, Proc. VLDB Endow..

[15]  Jure Leskovec,et al.  Interpretable Decision Sets: A Joint Framework for Description and Prediction , 2016, KDD.

[16]  Cesare Tinelli,et al.  DPLL( T): Fast Decision Procedures , 2004, CAV.

[17]  Michael Stonebraker,et al.  Detecting Data Errors: Where are we and what needs to be done? , 2016, Proc. VLDB Endow..

[18]  Jeffrey Xu Yu,et al.  Entity Matching: How Similar Is Similar , 2011, Proc. VLDB Endow..

[19]  Jeffrey F. Naughton,et al.  Towards Interactive Debugging of Rule-based Entity Matching , 2017, EDBT.

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  Armando Solar-Lezama,et al.  Program sketching , 2012, International Journal on Software Tools for Technology Transfer.

[22]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[23]  Heng Tao Shen,et al.  Hashing for Similarity Search: A Survey , 2014, ArXiv.

[24]  Erhard Rahm,et al.  Training selection for tuning entity matching , 2008, QDB/MUD.

[25]  Eibe Frank,et al.  Introducing Machine Learning Concepts with WEKA , 2016, Statistical Genomics.

[26]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[27]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[28]  Frederick Reiss,et al.  Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! , 2013, EMNLP.

[29]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[30]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[31]  Jayant Madhavan,et al.  OpenII: an open source information integration toolkit , 2010, SIGMOD Conference.

[32]  Anirban Dasgupta,et al.  Optimal hashing schemes for entity matching , 2013, WWW.

[33]  Armando Solar-Lezama,et al.  The Sketching Approach to Program Synthesis , 2009, APLAS.

[34]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[35]  Sean Davis,et al.  Statistical Genomics. Methods and Protocols. , 2016, Anticancer research.

[36]  Ahmed K. Elmagarmid,et al.  NADEEF/ER: generic and interactive entity resolution , 2014, SIGMOD Conference.

[37]  Michael Stonebraker,et al.  The Data Civilizer System , 2017, CIDR.