Power-Law Based Estimation of Set Similarity Join Size

We propose a novel technique for estimating the size of set similarity join. The proposed technique relies on a succinct representation of sets using Min-Hash signatures. We exploit frequent patterns in the signatures for the Set Similarity Join (SSJoin) size estimation by counting their support. However, there are overlaps among the counts of signature patterns and we need to use the set Inclusion-Exclusion (IE) principle. We develop a novel lattice-based counting method for efficiently evaluating the IE principle. The proposed counting technique is linear in the lattice size. To make the mining process very light-weight, we exploit a recently discovered Power-law relationship of pattern count and frequency. Extensive experimental evaluations show the proposed technique is capable of accurate and efficient estimation.

[1]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[2]  Theodore Johnson,et al.  Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.

[3]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[4]  Henrik Grosskreutz,et al.  A Randomized Approach for Approximating the Number of Frequent Sets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[5]  Christos Faloutsos,et al.  Spatial join selectivity using power laws , 2000, SIGMOD '00.

[6]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[7]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[8]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[9]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[10]  Xiaohui Yu,et al.  Hashed samples: selectivity estimators for set similarity selection queries , 2008, Proc. VLDB Endow..

[11]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[12]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[13]  Hussein H. Aly,et al.  Mining association rules , 2001, CATA.

[14]  Kyuseok Shim,et al.  Approximate substring selectivity estimation , 2009, EDBT '09.

[15]  S. Muthukrishnan,et al.  Selectively estimation for Boolean queries , 2000, PODS '00.

[16]  Ming-Syan Chen,et al.  Power-law relationship and self-similarity in the itemset support distribution: analysis and applications , 2008, The VLDB Journal.

[17]  Divyakant Agrawal,et al.  Detectives: detecting coalition hit inflation attacks in advertising networks streams , 2007, WWW '07.

[18]  Dong Wang,et al.  Estimating the number of frequent itemsets in a large database , 2009, EDBT '09.

[19]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[20]  Divesh Srivastava,et al.  Fast Indexes and Algorithms for Set Similarity Selection Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[21]  Renée J. Miller,et al.  ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.

[22]  Theoni Pitoura,et al.  Self-Join Size Estimation in Large-scale Distributed Data Systems , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[23]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[24]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[25]  Kyuseok Shim,et al.  Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance , 2007, VLDB.

[26]  John F. Roddick,et al.  Association mining , 2006, CSUR.

[27]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.