V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors

This work proposes V-SMART-Join, a scalable MapReduce-based framework for discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets, multisets, and vectors. V-SMART-Join is motivated by the observed skew in the underlying distributions of Internet traffic, and is a family of 2-stage algorithms, where the first stage computes and joins the partial results, and the second stage computes the similarity exactly for all candidate pairs. The V-SMART-Join algorithms are very efficient and scalable in the number of entities, as well as their cardinalities. They were up to 30 times faster than the state of the art algorithm, VCL, when compared on a real dataset of a small size. We also established the scalability of the proposed algorithms by running them on a dataset of a realistic size, on which VCL never succeeded to finish. Experiments were run using real datasets of IPs and cookies, where each IP is represented as a multiset of cookies, and the goal is to discover similar IPs to identify Internet proxies.

[1]  Yun Chi,et al.  Blog Community Discovery and Evolution Based on Mutual Awareness Expansion , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[2]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[3]  John Yen,et al.  Probabilistic Community Discovery Using Hierarchical Latent Gaussian Mixture Model , 2007, AAAI.

[4]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[5]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[6]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[7]  Trevor Darrell,et al.  Approximate Correspondences in High Dimensions , 2006, NIPS.

[8]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[9]  Sung-Hyuk Cha,et al.  On measuring the distance between histograms , 2002, Pattern Recognit..

[10]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[12]  GhemawatSanjay,et al.  The Google file system , 2003 .

[13]  Divyakant Agrawal,et al.  SLEUTH: Single-pubLisher attack dEtection Using correlaTion Hunting , 2008, Proc. VLDB Endow..

[14]  Ahmed Metwally,et al.  Estimating the number of users behind ip addresses for combating abusive traffic , 2011, KDD.

[15]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[16]  Divyakant Agrawal,et al.  Detectives: detecting coalition hit inflation attacks in advertising networks streams , 2007, WWW '07.

[17]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[18]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[19]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[20]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[21]  Srinivasan Parthasarathy,et al.  Scalable graph clustering using stochastic flows: applications to community discovery , 2009, KDD.

[22]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[23]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[24]  Ranieri Baraglia,et al.  Document Similarity Self-Join with MapReduce , 2010, 2010 IEEE International Conference on Data Mining.

[25]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[26]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[27]  R. Stanley Enumerative Combinatorics: Volume 1 , 2011 .

[28]  Weixiong Zhang,et al.  An Efficient Spectral Algorithm for Network Community Discovery and Its Applications to Biological and Social Networks , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[29]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[30]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[31]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[32]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[33]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[34]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.