Dimension independent similarity computation

We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high-dimensional sparse vectors. All of our results are provably independent of dimension, meaning that apart from the initial cost of trivially reading in the data, all subsequent operations are independent of the dimension; thus the dimension can be very large. We study Cosine, Dice, Overlap, and the Jaccard similarity measures. For Jaccard similarity we include an improved version of MinHash. Our results are geared toward the MapReduce framework. We empirically validate our theorems with large scale experiments using data from the social networking site Twitter. At time of writing, our algorithms are live in production at twitter.com.

[1]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[2]  Shui-Lung Chuang,et al.  Topic Hierarchy Generation for Text Segments: A Practical Web-based Approach , 2005 .

[3]  Reza Bosagh Zadeh,et al.  Dimension Independent Matrix Square using MapReduce , 2013, ArXiv.

[4]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[5]  Mehran Sahami,et al.  Evaluating similarity measures: a large-scale study in the orkut social network , 2005, KDD '05.

[6]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[7]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[8]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[9]  Jimmy J. Lin,et al.  WTF: the who to follow service at Twitter , 2013, WWW.

[10]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[11]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[12]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[13]  Kamesh Munagala,et al.  Complexity Measures for Map-Reduce, and Comparison to Parallel Computing , 2012, ArXiv.

[14]  Lukas Lewandowski,et al.  Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce , 2011 .

[15]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[16]  Sergei Vassilvitskii,et al.  Counting triangles and the curse of the last reducer , 2011, WWW.

[17]  Daniel C. Fain,et al.  Predicting Click-Through Rate Using Keyword Clusters , 2006 .

[18]  Ronald Fagin,et al.  Efficient similarity search and classification via rank aggregation , 2003, SIGMOD '03.

[19]  Rasmus Pagh,et al.  Finding Associations and Computing Similarity via Biased Pair Sampling , 2009, ICDM.

[20]  Ranieri Baraglia,et al.  Document Similarity Self-Join with MapReduce , 2010, 2010 IEEE International Conference on Data Mining.

[21]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[22]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[23]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[24]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[25]  Vibhanshu Abhishek,et al.  Keyword generation for search engine advertising using semantic similarity between terms , 2007, ICEC.

[26]  R. Kraus,et al.  Air Force Office of Scientific Research , 2015 .

[27]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[28]  Eric Crestan,et al.  Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[29]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[30]  Steve Chien,et al.  Semantic similarity between search engine queries using temporal correlation , 2005, WWW '05.