论文信息 - b-Bit Minwise Hashing for Large-Scale Learning

b-Bit Minwise Hashing for Large-Scale Learning

Abstract Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare b-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data.

[1] Ping Li,et al. Accurate Estimators for Improving Minwise Hashing and b-Bit Minwise Hashing , 2011, ArXiv.

[2] Chih-Jen Lin,et al. LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[3] George Forman,et al. Efficient detection of large-scale redundancy in enterprise file systems , 2009, OPSR.

[4] Gurmeet Singh Manku,et al. Detecting near-duplicates for web crawling , 2007, WWW '07.

[5] Ping Li,et al. Theory and applications of b-bit minwise hashing , 2011, Commun. ACM.

[6] Silvio Lattanzi,et al. On compressing social networks , 2009, KDD.

[7] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[8] Yoram Singer,et al. Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[9] Sreenivas Gollapudi,et al. An axiomatic approach for result diversification , 2009, WWW '09.

[10] Kenneth Ward Church,et al. Very sparse random projections , 2006, KDD '06.

[11] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[12] Marco Pellegrini,et al. Extraction and classification of dense implicit communities in the Web graph , 2009, TWEB.

[13] Ping Li,et al. b-Bit Minwise Hashing for Estimating Three-Way Similarities , 2010, NIPS.

[14] Marc Najork,et al. A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[15] Chih-Jen Lin,et al. A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[16] Ludmila Cherkasova,et al. Applying syntactic similarity algorithms for enterprise information management , 2009, KDD.

[17] Ping Li,et al. b-Bit minwise hashing , 2009, WWW '10.

[18] Thorsten Joachims,et al. Training linear SVMs in linear time , 2006, KDD '06.

[19] Sreenivas Gollapudi,et al. Less is more: sampling the neighborhood graph makes SALSA better and faster , 2009, WSDM '09.

[20] Kilian Q. Weinberger,et al. Feature hashing for large scale multitask learning , 2009, ICML '09.

[21] Chih-Jen Lin,et al. Large linear classification when data cannot fit in memory , 2010, KDD '10.

[22] Alexandr Andoni,et al. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[23] Ping Li,et al. b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions , 2012, ArXiv.

[24] W. Bruce Croft,et al. Finding text reuse on the web , 2009, WSDM '09.