Hashing Algorithms for Large-Scale Learning

Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare b-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data.

[1]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[2]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[3]  Patrick Haffner,et al.  Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[4]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[5]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[6]  Kenneth Ward Church,et al.  Using Sketches to Estimate Associations , 2005, HLT.

[7]  Matthias Hein,et al.  Hilbertian Metrics and Positive Definite Kernels on Probability Measures , 2005, AISTATS.

[8]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[9]  Kenneth Ward Church,et al.  Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data , 2006, NIPS.

[10]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[11]  Kenneth Ward Church,et al.  Improving Random Projections Using Marginal Information , 2006, COLT.

[12]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[13]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[14]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[15]  Konstantinos Kalpakis,et al.  Collaborative Data Gathering in Wireless Sensor Networks Using Measurement Co-Occurrence , 2007, 2007 International Conference on Sensor Technologies and Applications (SENSORCOMM 2007).

[16]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[17]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[18]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[19]  Thomas Lavergne,et al.  Tracking Web spam with HTML style similarities , 2008, TWEB.

[20]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[21]  Gregory Buehrer,et al.  A scalable pattern mining approach to web graph compression with communities , 2008, WSDM '08.

[22]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[23]  Sreenivas Gollapudi,et al.  An axiomatic approach for result diversification , 2009, WWW '09.

[24]  W. Bruce Croft,et al.  Finding text reuse on the web , 2009, WSDM '09.

[25]  Sreenivas Gollapudi,et al.  Less is more: sampling the neighborhood graph makes SALSA better and faster , 2009, WSDM '09.

[26]  John Langford,et al.  Hash Kernels for Structured Data , 2009, J. Mach. Learn. Res..

[27]  Marco Pellegrini,et al.  Extraction and classification of dense implicit communities in the Web graph , 2009, TWEB.

[28]  Ludmila Cherkasova,et al.  Applying syntactic similarity algorithms for enterprise information management , 2009, KDD.

[29]  George Forman,et al.  Efficient detection of large-scale redundancy in enterprise file systems , 2009, OPSR.

[30]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[31]  Cho-Jui Hsieh,et al.  Large linear classification when data cannot fit in memory , 2010, KDD.

[32]  Ping Li,et al.  b-Bit minwise hashing , 2009, WWW '10.

[33]  Ping Li,et al.  b-Bit Minwise Hashing for Estimating Three-Way Similarities , 2010, NIPS.

[34]  Ping Li,et al.  Accurate Estimators for Improving Minwise Hashing and b-Bit Minwise Hashing , 2011, ArXiv.

[35]  Ping Li,et al.  Theory and applications of b-bit minwise hashing , 2011, Commun. ACM.

[36]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[37]  Chih-Jen Lin,et al.  Large Linear Classification When Data Cannot Fit in Memory , 2011, TKDD.