Improving Grouped-Entity Resolution Using Quasi-Cliques

The entity resolution (ER) problem, which identifies duplicate entities that refer to the same real world entity, is essential in many applications. In this paper, in particular, we focus on resolving entities that contain a group of related elements in them (e.g., an author entity with a list of citations, a singer entity with song list, or an intermediate result by GROUP BY SQL query). Such entities, named as grouped-entities, frequently occur in many applications. The previous approaches toward grouped-entity resolution often rely on textual similarity, and produce a large number of false positives. As a complementing technique, in this paper, we present our experience of applying a recently proposed graph mining technique, Quasi-Clique, atop conventional ER solutions. Our approach exploits contextual information mined from the group of elements per entity in addition to syntactic similarity. Extensive experiments verify that our proposal improves precision and recall up to 83% when used together with a variety of existing ER solutions, but never worsens them.

[1]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[2]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[3]  Dmitri V. Kalashnikov,et al.  Exploiting Relationships for Domain-Independent Data Cleaning , 2005, SDM.

[4]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[5]  Byung-Won On,et al.  Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[6]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[7]  Byung-Won On,et al.  System Support for Name Authority Control Problem in Digital Libraries: OpenDBLP Approach , 2004, ECDL.

[8]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[9]  Xin Li,et al.  Constraint-Based Entity Matching , 2005, AAAI.

[10]  James W. Warner,et al.  Automated name authority control , 2001, JCDL '01.

[11]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[12]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[13]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[14]  Bradley Malin,et al.  Unsupervised Name Disambiguation via Social Network Similarity , 2005 .

[15]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[16]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[17]  Vijay H. Kothari,et al.  Cleaning the spurious links in data , 2004, IEEE Intelligent Systems.

[18]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[19]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[20]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[21]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[22]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[23]  Jian Pei,et al.  On mining cross-graph quasi-cliques , 2005, KDD '05.

[24]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.