0-Bit Consistent Weighted Sampling

We develop 0-bit consistent weighted sampling (CWS) for efficiently estimating min-max kernel, which is a generalization of the resemblance kernel originally designed for binary data. Because the estimator of 0-bit CWS constitutes a positive definite kernel, this method can be naturally applied to large-scale data mining problems. Basically, if we feed the sampled data from 0-bit CWS to a highly efficient linear classifier (e.g., linear SVM), we effectively (and approximately) train a nonlinear classifier based on the min-max kernel. The accuracy improves as we increase the sample size. In this paper, we first demonstrate, through an extensive classification study using kernel machines, that the min-max kernel often provides an effective measure of similarity for nonnegative data. This helps justify the use of min-max kernel. However, as the min-max kernel is nonlinear and might be difficult to be used for industrial applications with massive data, we propose to linearize the min-max kernel via 0-bit CWS, a simplification of the original CWS method. The previous remarkable work on {\em consistent weighted sampling (CWS)} produces samples in the form of (i*, t*) where the i* records the location (and in fact also the weights) information analogous to the samples produced by classical minwise hashing on binary data. Because the t* is theoretically unbounded, it was not immediately clear how to effectively implement CWS for building large-scale linear classifiers. We provide a simple solution by discarding t* (which we refer to as the "0-bit" scheme). Via an extensive empirical study, we show that this 0-bit scheme does not lose essential information. We then apply 0-bit CWS for building linear classifiers to approximate min-max kernel classifiers, as extensively validated on a wide range of public datasets. We expect this work will generate interests among data mining practitioners who would like to efficiently utilize the nonlinear information of non-binary and nonnegative data.

[1]  Hans C. van Houwelingen,et al.  The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, New York, 2001. No. of pages: xvi+533. ISBN 0‐387‐95284‐5 , 2004 .

[2]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[3]  Graham Cormode,et al.  Space efficient mining of multigraph streams , 2005, PODS.

[4]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[5]  Kunal Talwar,et al.  Consistent Weighted Sampling , 2007 .

[6]  Ludmila Cherkasova,et al.  Applying syntactic similarity algorithms for enterprise information management , 2009, KDD.

[7]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[8]  Subhransu Maji,et al.  Classification using intersection kernel support vector machines is efficient , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Ping Li,et al.  ABC-boost: adaptive base class boost for multi-class classification , 2008, ICML '09.

[10]  George Forman,et al.  Efficient detection of large-scale redundancy in enterprise file systems , 2009, OPSR.

[11]  Ping Li,et al.  b-Bit minwise hashing , 2009, WWW '10.

[12]  Jason Weston,et al.  Large-scale kernel machines , 2007 .

[13]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[14]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[15]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[16]  Sergei Vassilvitskii,et al.  Nearest-neighbor caching for content-match applications , 2009, WWW '09.

[17]  Ping Li,et al.  Robust LogitBoost and Adaptive Base Class (ABC) LogitBoost , 2010, UAI.

[18]  Ping Li,et al.  CoRE Kernels , 2014, UAI.

[19]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[20]  Martin Wattenberg,et al.  Ad click prediction: a view from the trenches , 2013, KDD.

[21]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[22]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[23]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[24]  Thomas Lavergne,et al.  Tracking Web spam with HTML style similarities , 2008, TWEB.

[25]  Sergey Ioffe,et al.  Improved Consistent Sampling, Weighted Minhash and L1 Sketching , 2010, 2010 IEEE International Conference on Data Mining.

[26]  Ping Li,et al.  Hashing Algorithms for Large-Scale Learning , 2011, NIPS.

[27]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[28]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..