Generic entity resolution with negative rules

Entity resolution (ER) (also known as deduplication or merge-purge) is a process of identifying records that refer to the same real-world entity and merging them together. In practice, ER results may contain “inconsistencies,” either due to mistakes by the match and merge function writers or changes in the application semantics. To remove the inconsistencies, we introduce “negative rules” that disallow inconsistencies in the ER solution (ER-N). A consistent solution is then derived based on the guidance from a domain expert. The inconsistencies can be resolved in several ways, leading to accurate solutions. We formalize ER-N, treating the match, merge, and negative rules as black boxes, which permits expressive and extensible ER-N solutions. We identify important properties for the rules that, if satisfied, enable less costly ER-N. We develop and evaluate two algorithms that find an ER-N solution based on guidance from the domain expert: the GNR algorithm that does not assume the properties and the ENR algorithm that exploits the properties.

[1]  Lise Getoor,et al.  Relational clustering for multi-type entity resolution , 2005, MRDM '05.

[2]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[3]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[4]  Hector Garcia-Molina,et al.  Additional Experiments on Negative Rules , 2008 .

[5]  Hongjun Lu,et al.  Discovering and reconciling value conflicts for numerical data integration , 2001, Inf. Syst..

[6]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[7]  Francesco Scarcello,et al.  Census Data Repair: a Challenging Application of Disjunctive Logic Programming , 2001, LPAR.

[8]  Jan Chomicki,et al.  On the Computational Complexity of Minimal-Change Integrity Maintenance in Relational Databases , 2005, Inconsistency Tolerance.

[9]  Xin Li,et al.  Constraint-Based Entity Matching , 2005, AAAI.

[10]  Michael P. Smith Book review: The Logical Foundations of Artificial Intelligence. by Michael R. Genesereth and Nils Nilsson (Morgan Kaufmann 1987) , 1988, SGAR.

[11]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[12]  Jiawei Han,et al.  Object Matching for Information Integration: A Profiler-Based Approach , 2003, IIWeb.

[13]  Jennifer Widom,et al.  Active Database Systems: Triggers and Rules For Advanced Database Processing , 1994 .

[14]  Hector Garcia-Molina,et al.  Generic Entity Resolution with Data Confidences , 2006, CleanDB.

[15]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[16]  Jiawei Han,et al.  Profile-Based Object Matching for Information Integration , 2003, IEEE Intell. Syst..

[17]  Michael R. Genesereth,et al.  Logical foundations of artificial intelligence , 1987 .

[18]  Surajit Chaudhuri,et al.  Leveraging aggregate constraints for deduplication , 2007, SIGMOD '07.

[19]  Donald D. Chamberlin,et al.  Functional specifications of a subsystem for data base integrity , 1975, VLDB '75.

[20]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[21]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[22]  Dennis McLeod,et al.  Semantic integrity in a relational data base system , 1975, VLDB '75.

[23]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[24]  Hector Garcia-Molina,et al.  D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[25]  Matthew L. Ginsberg,et al.  Readings in Nonmonotonic Reasoning , 1987, AAAI 1987.

[26]  Saso Dzeroski,et al.  Proceedings of the 4th international workshop on Multi-relational mining , 2005 .

[27]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[28]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[29]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[30]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[31]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[32]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[33]  Nils J. Nilsson,et al.  Artificial Intelligence: A New Synthesis , 1997 .

[34]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[35]  William E. Winkler,et al.  STATE OF STATISTICAL DATA EDITING AND CURRENT RESEARCH PROBLEMS , 1999 .

[36]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.