b-Bit Minwise Hashing for Large-Scale Learning

Abstract Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare b-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data.

[1]  Ping Li,et al.  Accurate Estimators for Improving Minwise Hashing and b-Bit Minwise Hashing , 2011, ArXiv.

[2]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[3]  George Forman,et al.  Efficient detection of large-scale redundancy in enterprise file systems , 2009, OPSR.

[4]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[5]  Ping Li,et al.  Theory and applications of b-bit minwise hashing , 2011, Commun. ACM.

[6]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[7]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[8]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[9]  Sreenivas Gollapudi,et al.  An axiomatic approach for result diversification , 2009, WWW '09.

[10]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[11]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[12]  Marco Pellegrini,et al.  Extraction and classification of dense implicit communities in the Web graph , 2009, TWEB.

[13]  Ping Li,et al.  b-Bit Minwise Hashing for Estimating Three-Way Similarities , 2010, NIPS.

[14]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[15]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[16]  Ludmila Cherkasova,et al.  Applying syntactic similarity algorithms for enterprise information management , 2009, KDD.

[17]  Ping Li,et al.  b-Bit minwise hashing , 2009, WWW '10.

[18]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[19]  Sreenivas Gollapudi,et al.  Less is more: sampling the neighborhood graph makes SALSA better and faster , 2009, WSDM '09.

[20]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[21]  Chih-Jen Lin,et al.  Large linear classification when data cannot fit in memory , 2010, KDD '10.

[22]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[23]  Ping Li,et al.  b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions , 2012, ArXiv.

[24]  W. Bruce Croft,et al.  Finding text reuse on the web , 2009, WSDM '09.