Evaluating Entity Resolution Results (Extended version)

Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise $F_1$, cluster $F_1$) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application at hand. In this paper, our contributions are twofold. First, we conduct an extensive survey on existing ER measures, showing that they can often conflict with each other by ranking the results of ER algorithms differently. Second, we propose a new distance measure for ER (called ``merge distance'') inspired by the edit distance of strings, using cluster splits and merges as its basic operations. A significant advantage of merge distance is that the cost functions for splits and merges can be configured to adjust two important parameters: sensitivity to error type and sensitivity to cluster size. This flexibility enables us to clearly understand the characteristics of a defined merge distance measure. Surprisingly, the widely used pairwise $F_1$ measure and a state-of-the-art clustering measure called Variation of Information are both special cases of our merge distance measure. We present an efficient linear-time algorithm that correctly computes the merge distance measure for a large class of cost functions that satisfy reasonable properties.

[1]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[2]  M. Hosszú,et al.  On the functional equation F(x+y,z)+F(x,y)=F(x,y+z)+F(y,z) , 1971 .

[3]  Robert A. Wagner,et al.  On the complexity of the Extended String-to-String Correction Problem , 1975, STOC.

[4]  Esko Ukkonen,et al.  On Approximate String Matching , 1983, FCT.

[5]  William E. Winkler,et al.  AN APPLICATION OF THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE TO THE 1990 U.S. DECENNIAL CENSUS , 1987 .

[6]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[7]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[8]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[9]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[10]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[11]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[12]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[13]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[14]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[15]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[16]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[17]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[18]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[19]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[20]  David W. Embley,et al.  Grouping search-engine returned citations for person-name queries , 2004, WIDM '04.

[21]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[22]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[23]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[24]  Andrew McCallum,et al.  Joint deduplication of multiple record types in relational data , 2005, CIKM '05.

[25]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[26]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[27]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[28]  Latif Al-Hakim,et al.  Information Quality Management: Theory and Applications , 2006 .

[29]  Marcos André Gonçalves,et al.  A Heuristic-based Hierarchical Clustering Method for Author Name Disambiguation in Digital Libraries , 2007, SBBD.

[30]  Yang Song,et al.  Efficient topic-based unsupervised name disambiguation , 2007, JCDL '07.

[31]  Rodrygo L. T. Santos,et al.  Keeping a digital library clean: new solutions to old problems , 2008, DocEng '08.

[32]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[33]  Hector Garcia-Molina,et al.  Clustering the tagged web , 2009, WSDM '09.

[34]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[35]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..