Top-k Set Similarity Joins

Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this paper, we study a variant of the similarity join, termed top-k set similarity join. It returns the top-k pairs of records ranked by their similarities, thus eliminating the guess work users have to perform when the similarity threshold is unknown before hand. An algorithm, topk-join, is proposed to answer top-k similarity join efficiently. It is based on the prefix filtering principle and employs tight upper bounding of similarity values of unseen pairs. Experimental results demonstrate the efficiency of the proposed algorithm on large-scale real datasets.

[1]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[2]  Yannis Manolopoulos,et al.  The Impact of Buffering on Closest Pairs Queries Using R-Trees , 2001, ADBIS.

[3]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[4]  Surajit Chaudhuri,et al.  Transformation-based Framework for Record Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[5]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[6]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[7]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[8]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[9]  Yannis Manolopoulos,et al.  Algorithms for processing K-closest-pair queries in spatial databases , 2004, Data Knowl. Eng..

[10]  Seung-won Hwang,et al.  Probe Minimization by Schedule Optimization: Supporting Top-K Queries with Expensive Predicates , 2007, IEEE Transactions on Knowledge and Data Engineering.

[11]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[12]  Aristides Gionis,et al.  Automated Ranking of Database Query Results , 2003, CIDR.

[13]  Dimitris Papadias,et al.  Top-k spatial joins , 2005, IEEE Transactions on Knowledge and Data Engineering.

[14]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[15]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[16]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[17]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[18]  Hanan Samet,et al.  A Fast Similarity Join Algorithm Using Graphics Processing Units , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[19]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[20]  Kyuseok Shim,et al.  Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance , 2007, VLDB.

[21]  Yannis Manolopoulos,et al.  Closest pair queries in spatial databases , 2000, SIGMOD '00.

[22]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[23]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[24]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[25]  Esko Ukkonen,et al.  On Approximate String Matching , 1983, FCT.

[26]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[27]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[28]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[29]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[30]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[31]  Divesh Srivastava,et al.  Fast Indexes and Algorithms for Set Similarity Selection Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[32]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[33]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[34]  Seung-won Hwang,et al.  Minimal probing: supporting expensive predicates for top-k queries , 2002, SIGMOD '02.

[35]  Xuemin Lin,et al.  SPARK2: Top-k Keyword Query in Relational Databases , 2007, IEEE Transactions on Knowledge and Data Engineering.

[36]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .