Training Logistic Regression and SVM on 200GB Data Using b-Bit Minwise Hashing and Comparisons with Vowpal Wabbit (VW)

We generated a dataset of 200 GB with 10^9 features, to test our recent b-bit minwise hashing algorithms for training very large-scale logistic regression and SVM. The results confirm our prior work that, compared with the VW hashing algorithm (which has the same variance as random projections), b-bit minwise hashing is substantially more accurate at the same storage. For example, with merely 30 hashed values per data point, b-bit minwise hashing can achieve similar accuracies as VW with 2^14 hashed values per data point. We demonstrate that the preprocessing cost of b-bit minwise hashing is roughly on the same order of magnitude as the data loading time. Furthermore, by using a GPU, the preprocessing cost can be reduced to a small fraction of the data loading time. Minwise hashing has been widely used in industry, at least in the context of search. One reason for its popularity is that one can efficiently simulate permutations by (e.g.,) universal hashing. In other words, there is no need to store the permutation matrix. In this paper, we empirically verify this practice, by demonstrating that even using the simplest 2-universal hashing does not degrade the learning performance.

[1]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[2]  KalpakisKonstantinos,et al.  Collaborative data gathering in wireless sensor networks using measurement co-occurrence , 2008 .

[3]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[4]  Konstantinos Kalpakis,et al.  Collaborative Data Gathering in Wireless Sensor Networks Using Measurement Co-Occurrence , 2007, 2007 International Conference on Sensor Technologies and Applications (SENSORCOMM 2007).

[5]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[6]  Marco Pellegrini,et al.  Extraction and classification of dense implicit communities in the Web graph , 2009, TWEB.

[7]  Ping Li,et al.  b-Bit Minwise Hashing for Estimating Three-Way Similarities , 2010, NIPS.

[8]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[9]  Sreenivas Gollapudi,et al.  Less is more: sampling the neighborhood graph makes SALSA better and faster , 2009, WSDM '09.

[10]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[11]  Chih-Jen Lin,et al.  Large linear classification when data cannot fit in memory , 2010, KDD '10.

[12]  Sreenivas Gollapudi,et al.  An axiomatic approach for result diversification , 2009, WWW '09.

[13]  George Forman,et al.  Efficient detection of large-scale redundancy in enterprise file systems , 2009, OPSR.

[14]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[15]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[16]  Ping Li,et al.  Hashing Algorithms for Large-Scale Learning , 2011, NIPS.

[17]  Ludmila Cherkasova,et al.  Applying syntactic similarity algorithms for enterprise information management , 2009, KDD.

[18]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[19]  Chih-Jen Lin,et al.  Large Linear Classification When Data Cannot Fit in Memory , 2011, TKDD.

[20]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[21]  Gregory Buehrer,et al.  A scalable pattern mining approach to web graph compression with communities , 2008, WSDM '08.

[22]  Ping Li,et al.  b-Bit minwise hashing , 2009, WWW '10.

[23]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[24]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[25]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[26]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[27]  Thomas Lavergne,et al.  Tracking Web spam with HTML style similarities , 2008, TWEB.

[28]  John Langford,et al.  Hash Kernels for Structured Data , 2009, J. Mach. Learn. Res..

[29]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[30]  Ping Li,et al.  Theory and applications of b-bit minwise hashing , 2011, Commun. ACM.

[31]  W. Bruce Croft,et al.  Finding text reuse on the web , 2009, WSDM '09.

[32]  Ping Li,et al.  b-Bit Minwise Hashing for Large-Scale Linear SVM , 2011, ArXiv.