论文信息 - Hashing Algorithms for Large-Scale Learning

Hashing Algorithms for Large-Scale Learning

Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare b-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data.

[1] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[2] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[3] Patrick Haffner,et al. Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[4] Marc Najork,et al. A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[5] Dimitris Achlioptas,et al. Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[6] Kenneth Ward Church,et al. Using Sketches to Estimate Associations , 2005, HLT.

[7] Matthias Hein,et al. Hilbertian Metrics and Positive Definite Kernels on Probability Measures , 2005, AISTATS.

[8] Graham Cormode,et al. An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[9] Kenneth Ward Church,et al. Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data , 2006, NIPS.

[10] Thorsten Joachims,et al. Training linear SVMs in linear time , 2006, KDD '06.

[11] Kenneth Ward Church,et al. Improving Random Projections Using Marginal Information , 2006, COLT.

[12] Alexandr Andoni,et al. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[13] Kenneth Ward Church,et al. Very sparse random projections , 2006, KDD '06.