Efficient entity resolution based on subgraph cohesion

Entity resolution has wide applications and receives considerable attentions in literature. For entity resolution, similarity functions are often used to judge whether two data objects refer to the same real-world entity. However, the similar relations determined by many commonly used similarity functions lack transitivity. This fact results in the conflict that $$A$$A and $$B$$B refer to the same entity and $$B$$B and $$C$$C refer to the same entity, but $$A$$A and $$C$$C do not refer to the same entity. To address this problem and make the group-wise entity resolution results consistent with pairwise entity resolution, this paper models the entity resolution problem as the partition of the vertices in a weighted graph into cohesive subgraphs, which is proven to be co-NP-complete. To solve this problem, an approximate algorithm with approximation ratio bound is proposed. For performing entity resolution on a large data set efficiently, a heuristic algorithm is developed to address this problem. In order to implement the heuristic algorithm efficiently, a similarity measure compatible with many measures in common usage is presented. With such similarity measure, indices and efficient implementations for the heuristic algorithm are proposed. Extensive experiments have been performed to verify the efficiency and effectiveness of the methods in this paper.

[1]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[2]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[3]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[4]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[5]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[6]  Surajit Chaudhuri,et al.  Leveraging aggregate constraints for deduplication , 2007, SIGMOD '07.

[7]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[8]  Hector Garcia-Molina,et al.  Evaluating entity resolution results , 2010, Proc. VLDB Endow..

[9]  Dmitri V. Kalashnikov,et al.  Domain-independent data cleaning via analysis of entity-relationship graph , 2006, TODS.

[10]  Dmitri V. Kalashnikov,et al.  Exploiting context analysis for combining multiple entity resolution systems , 2009, SIGMOD Conference.

[11]  Philip S. Yu,et al.  Object Distinction: Distinguishing Objects with Identical Names , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[12]  Surajit Chaudhuri,et al.  Example-driven design of efficient record matching queries , 2007, VLDB.

[13]  Dongwon Lee,et al.  HARRA: fast iterative hashed record linkage for large-scale data collections , 2010, EDBT '10.

[14]  Silvio Micali,et al.  An O(v|v| c |E|) algoithm for finding maximum matching in general graphs , 1980, 21st Annual Symposium on Foundations of Computer Science (sfcs 1980).

[15]  Daniela Rus,et al.  Journal of Graph Algorithms and Applications the Star Clustering Algorithm for Static and Dynamic Information Organization , 2022 .

[16]  Xin Li,et al.  Constraint-Based Entity Matching , 2005, AAAI.

[17]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[18]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[19]  Avishek Saha,et al.  Metric Functional Dependencies , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[20]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[21]  Satu Elisa Schaeffer,et al.  Graph Clustering , 2017, Encyclopedia of Machine Learning and Data Mining.

[22]  Craig A. Knoblock,et al.  Mining the Heterogeneous Transformations between Data Sources to Aid Record Linkage , 2009, IC-AI.

[23]  Raghu Ramakrishnan,et al.  Source-aware Entity Matching: A Compositional Approach , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[24]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[25]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[26]  Renée J. Miller,et al.  Creating probabilistic databases from duplicated data , 2009, The VLDB Journal.

[27]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[28]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[29]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[30]  Bin Wang,et al.  Cost-based variable-length-gram selection for string collections to support approximate queries efficiently , 2008, SIGMOD Conference.

[31]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[32]  David G. Stork,et al.  Pattern Classification , 1973 .

[33]  Mihalis Yannakakis,et al.  Node-and edge-deletion NP-complete problems , 1978, STOC.

[34]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.