论文信息 - 0-Bit Consistent Weighted Sampling

0-Bit Consistent Weighted Sampling

We develop 0-bit consistent weighted sampling (CWS) for efficiently estimating min-max kernel, which is a generalization of the resemblance kernel originally designed for binary data. Because the estimator of 0-bit CWS constitutes a positive definite kernel, this method can be naturally applied to large-scale data mining problems. Basically, if we feed the sampled data from 0-bit CWS to a highly efficient linear classifier (e.g., linear SVM), we effectively (and approximately) train a nonlinear classifier based on the min-max kernel. The accuracy improves as we increase the sample size. In this paper, we first demonstrate, through an extensive classification study using kernel machines, that the min-max kernel often provides an effective measure of similarity for nonnegative data. This helps justify the use of min-max kernel. However, as the min-max kernel is nonlinear and might be difficult to be used for industrial applications with massive data, we propose to linearize the min-max kernel via 0-bit CWS, a simplification of the original CWS method. The previous remarkable work on {\em consistent weighted sampling (CWS)} produces samples in the form of (i*, t*) where the i* records the location (and in fact also the weights) information analogous to the samples produced by classical minwise hashing on binary data. Because the t* is theoretically unbounded, it was not immediately clear how to effectively implement CWS for building large-scale linear classifiers. We provide a simple solution by discarding t* (which we refer to as the "0-bit" scheme). Via an extensive empirical study, we show that this 0-bit scheme does not lose essential information. We then apply 0-bit CWS for building linear classifiers to approximate min-max kernel classifiers, as extensively validated on a wide range of public datasets. We expect this work will generate interests among data mining practitioners who would like to efficiently utilize the nonlinear information of non-binary and nonnegative data.

Ping Li | Ping Li

[1] Hans C. van Houwelingen,et al. The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, New York, 2001. No. of pages: xvi+533. ISBN 0‐387‐95284‐5 , 2004 .

[2] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[3] Graham Cormode,et al. Space efficient mining of multigraph streams , 2005, PODS.

[4] Robert Tibshirani,et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[5] Kunal Talwar,et al. Consistent Weighted Sampling , 2007 .

[6] Ludmila Cherkasova,et al. Applying syntactic similarity algorithms for enterprise information management , 2009, KDD.

[7] Abhinandan Das,et al. Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[8] Subhransu Maji,et al. Classification using intersection kernel support vector machines is efficient , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9] Ping Li,et al. ABC-boost: adaptive base class boost for multi-class classification , 2008, ICML '09.

[10] George Forman,et al. Efficient detection of large-scale redundancy in enterprise file systems , 2009, OPSR.

[11] Ping Li,et al. b-Bit minwise hashing , 2009, WWW '10.

[12] Jason Weston,et al. Large-scale kernel machines , 2007 .

[13] Thorsten Joachims,et al. Training linear SVMs in linear time , 2006, KDD '06.

[14] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[15] Roberto J. Bayardo,et al. Scaling up all pairs similarity search , 2007, WWW '07.

[16] Sergei Vassilvitskii,et al. Nearest-neighbor caching for content-match applications , 2009, WWW '09.

[17] Ping Li,et al. Robust LogitBoost and Adaptive Base Class (ABC) LogitBoost , 2010, UAI.

[18] Ping Li,et al. CoRE Kernels , 2014, UAI.

[19] Chih-Jen Lin,et al. LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[20] Martin Wattenberg,et al. Ad click prediction: a view from the trenches , 2013, KDD.

[21] Divesh Srivastava,et al. Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[22] Silvio Lattanzi,et al. On compressing social networks , 2009, KDD.

[23] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[24] Thomas Lavergne,et al. Tracking Web spam with HTML style similarities , 2008, TWEB.

[25] Sergey Ioffe,et al. Improved Consistent Sampling, Weighted Minhash and L1 Sketching , 2010, 2010 IEEE International Conference on Data Mining.

[26] Ping Li,et al. Hashing Algorithms for Large-Scale Learning , 2011, NIPS.

[27] Yoshua Bengio,et al. An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[28] Yoram Singer,et al. Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..