Large-Scale Linear NPSVM via One Permutation Hashing

Nonparallel support vector machine (NPSVM) is a novel nonparallel classifier for binary classification with large amounts of theoretical and practical advantages. How to train NPSVM efficiently on data with huge dimensions has not been studied. Recently, a variety of minwise hashing algorithms such as b-bit minwise hashing, connected bit minwise hashing and f-fractional bit minwise hashing are effectively applied to obtain a compact representation of the original data. However, they still have serious defects that the generation of k random permutations is time-consuming and the processing of the original dataset damages data structure. Fortunately, a simple and effective solution called one permutation hashing appears to avoid the disadvantages of the expensive preprocessing cost and the destruction of the original dataset. In this paper, we combine one permutation hashing scheme with linear NPSVM to speed up the training and testing phases for classification on large-scale and high dimensional datasets. Both theoretical analyses and experiments demonstrate that our algorithm achieves massive advantages in accuracy, efficiency and energy-consumption.

[1]  Thomas Lavergne,et al.  Tracking Web spam with HTML style similarities , 2008, TWEB.

[2]  Reshma Khemchandani,et al.  Twin Support Vector Machines for Pattern Classification , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  J. Schroeter,et al.  Speech and language processing for next-millennium communications services , 2000, Proceedings of the IEEE.

[4]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[5]  Qin Zhang,et al.  Large-scale linear nonparallel support vector machine solver , 2014, Neurocomputing.

[6]  Sreenivas Gollapudi,et al.  Less is more: sampling the neighborhood graph makes SALSA better and faster , 2009, WSDM '09.

[7]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[8]  Konstantinos Kalpakis,et al.  Collaborative Data Gathering in Wireless Sensor Networks Using Measurement Co-Occurrence , 2007 .

[9]  Sreenivas Gollapudi,et al.  An axiomatic approach for result diversification , 2009, WWW '09.

[10]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[11]  Ping Li,et al.  Hashing Algorithms for Large-Scale Learning , 2011, NIPS.

[12]  Jingjing Tang,et al.  Connected bit minwise hashing for large-scale linear SVM , 2015, 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[13]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[14]  Jingjing Tang,et al.  f-Fractional Bit Minwise Hashing for Large-Scale Learning , 2015, 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT).

[15]  Ping Li,et al.  One Permutation Hashing , 2012, NIPS.