论文信息 - Efficient exact set-similarity joins

Efficient exact set-similarity joins

Given two input collections of sets, a set-similarity join (SSJoin) identifies all pairs of sets, one from each collection, that have high similarity. Recent work has identified SSJoin as a useful primitive operator in data cleaning. In this paper, we propose new algorithms for SSJoin. Our algorithms have two important features: They are exact, i.e., they always produce the correct answer, and they carry precise performance guarantees. We believe our algorithms are the first to have both features; previous algorithms with performance guarantees are only probabilistically approximate. We demonstrate the effectiveness of our algorithms using a thorough experimental evaluation over real-life and synthetic data sets.

[1] Ivan P. Fellegi,et al. A Theory for Record Linkage , 1969 .

[2] Salvatore J. Stolfo,et al. The merge/purge problem for large databases , 1995, SIGMOD '95.

[3] Noga Alon,et al. The space complexity of approximating the frequency moments , 1996, STOC '96.

[4] William W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[5] Piotr Indyk,et al. Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[6] Andrei Z. Broder,et al. Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[7] Rajeev Motwani,et al. On random sampling over joins , 1999, SIGMOD '99.

[8] Jeffrey F. Naughton,et al. Set Containment Joins: The Good, The Bad and The Ugly , 2000, VLDB.

[9] Piotr Indyk,et al. Scalable Techniques for Clustering the Web , 2000, WebDB.

[10] Edith Cohen,et al. Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[11] Luis Gravano,et al. Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[12] Anuradha Bhamidipaty,et al. Interactive deduplication using active learning , 2002, KDD.

[13] Surajit Chaudhuri,et al. Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[14] Nikos Mamoulis,et al. Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[15] Rajeev Motwani,et al. Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[16] Hector Garcia-Molina,et al. Adaptive algorithms for set containment joins , 2003, TODS.

[17] P. Ivax,et al. A THEORY FOR RECORD LINKAGE , 2004 .

[18] Sunita Sarawagi,et al. Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[19] Richard M. Karp,et al. Gapped Local Similarity Search with Provable Guarantees , 2004, WABI.

[20] Renée J. Miller,et al. ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.

[21] Jayant Madhavan,et al. Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[22] Surajit Chaudhuri,et al. A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[23] Lakshmi Chaudhry. Mirror, Mirror on the Web , 2007 .