Bufoosh: Buffering Algorithms for Generic Entity Resolution

Entity Resolution (ER) is a problem that arises in many information integration applications. ER process identifies duplicated records that refer to the same real-world entity (match process), and derives composite information about the entity (merge process). Even though the cost of the match process is high, the cost of disk I/O is likely to be the dominant problem for a limited memory or very large record set. In this paper, we proposed buffering algorithms for ER, Bufoosh, based on lazy disk update and locality-aware match scheduling. Our evaluation results using Yahoo! shopping data show that the algorithms can reduce disk I/O dramatically.

[1]  Craig A. Knoblock,et al.  Exploiting Secondary Sources for Unsupervised Record Linkage , 2004 .

[2]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[3]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[4]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[5]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[6]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[7]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[8]  George V. Moustakides,et al.  A Bayesian decision model for cost optimal record matching , 2003, The VLDB Journal.

[9]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[10]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[11]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[12]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[13]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[14]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[15]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[16]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[17]  Pedro M. Domingos,et al.  Object Identification with Attribute-Mediated Dependences , 2005, PKDD.

[18]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[19]  Paul Hsiung,et al.  Alias Detection in Link Data Sets , 2004 .

[20]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[21]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[22]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[23]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[24]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[25]  Sugato Basu,et al.  Adaptive product normalization: using online learning for record linkage in comparison shopping , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).