Randomness Efficient Feature Hashing for Sparse Binary Data

We present sketching algorithms for sparse binary datasets, which maintain binary version of the dataset after sketching, while simultaneously preserving multiple similarity measures such as Jaccard Similarity, Cosine Similarity, Inner Product, and Hamming Distance, on the same sketch. A major advantage of our algorithms is that they are randomness efficient, and require significantly less number of random bits for sketching – logarithmic in dimension, while other competitive algorithms require linear in dimension. Our proposed algorithms are efficient, offer a compact sketch of the dataset, and can be efficiently deployed in a distributive setting. We present a theoretical analysis of our approach and complement them with extensive experimentations on public datasets. For analysis purposes, our algorithms require a natural assumption on the dataset. We empirically verify the assumption and notice that it holds on several real-world datasets.

[1]  Raghav Kulkarni,et al.  Efficient Dimensionality Reduction for Sparse Binary Data , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[2]  Anshumali Shrivastava,et al.  Optimal Densification for Fast and Accurate Minwise Hashing , 2017, ICML.

[3]  Shih-Fu Chang,et al.  Circulant Binary Embedding , 2014, ICML.

[4]  Maosong Sun,et al.  Semi-Supervised SimHash for Efficient Document Similarity Search , 2011, ACL.

[5]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[6]  Spiridon Bakiras,et al.  Secure Similar Document Detection with Simhash , 2013, Secure Data Management.

[7]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[8]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[9]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[10]  Ping Li,et al.  Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment , 2015, WWW.

[11]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[12]  Ping Li,et al.  In Defense of Minhash over Simhash , 2014, AISTATS.

[13]  Dmitri Loguinov,et al.  Probabilistic near-duplicate detection using simhash , 2011, CIKM '11.

[14]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[15]  Raghav Kulkarni,et al.  Efficient Compression Technique for Sparse Sets , 2018, PAKDD.

[16]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[17]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[18]  Rameshwar Pratap,et al.  Efficient Sketching Algorithm for Sparse Binary Data , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[19]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[20]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[21]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.