Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing

Support vector machines (SVMs) have been promising methods for classification and regression analysis due to their solid mathematical foundations, which include two desirable properties: margin maximization and nonlinear classification using kernels. However, despite these prominent properties, SVMs are usually not chosen for large-scale data mining problems because their training complexity is highly dependent on the data set size. Unlike traditional pattern recognition and machine learning, real-world data mining applications often involve huge numbers of data records. Thus it is too expensive to perform multiple scans on the entire data set, and it is also infeasible to put the data set in memory. This paper presents a method, Clustering-Based SVM (CB-SVM), that maximizes the SVM performance for very large data sets given a limited amount of resource, e.g., memory. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples. These samples carry statistical summaries of the data and maximize the benefit of learning. Our analyses show that the training complexity of CB-SVM is quadratically dependent on the number of support vectors, which is usually much less than that of the entire data set. Our experiments on synthetic and real-world data sets show that CB-SVM is highly scalable for very large data sets and very accurate in terms of classification.

[1]  Stefan Rüping,et al.  Incremental Learning with Support Vector Machines , 2001, ICDM.

[2]  Ian Witten,et al.  Data Mining , 2000 .

[3]  Stefan Wrobel,et al.  Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling , 2003, J. Mach. Learn. Res..

[4]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[5]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[6]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  Glenn Fung,et al.  Proximal support vector machine classifiers , 2001, KDD '01.

[9]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[10]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[11]  Yuh-Jye Lee,et al.  RSVM: Reduced Support Vector Machines , 2001, SDM.

[12]  Bernhard Schölkopf,et al.  SV Estimation of a Distribution's Support , 1999, NIPS 1999.

[13]  Gert Cauwenberghs,et al.  Incremental and Decremental Support Vector Machine Learning , 2000, NIPS.

[14]  Kevin Chen-Chuan Chang,et al.  PEBL: positive example based learning for Web page classification using SVM , 2002, KDD.

[15]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[16]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[17]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[18]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[19]  Deepak K. Agarwal,et al.  Shrinkage estimator generalizations of Proximal Support Vector Machines , 2002, KDD.

[20]  John Platt,et al.  Fast training of svms using sequential minimal optimization , 1998 .

[21]  Huan Liu,et al.  Handling concept drifts in incremental learning with support vector machines , 1999, KDD '99.

[22]  Dan Roth,et al.  Learning Active Classifiers , 1996, ICML.

[23]  David R. Musicant,et al.  Active Support Vector Machine Classification , 2000, NIPS.

[24]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[25]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[26]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[27]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[28]  James C. French,et al.  Clustering large datasets in arbitrary metric spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[29]  Osamu Watanabe,et al.  A Random Sampling Technique for Training Support Vector Machines , 2001, ALT.

[30]  Chih-Jen Lin,et al.  Training nu-Support Vector Classifiers: Theory and Algorithms , 2001, Neural Comput..

[31]  Yu-Han Chang,et al.  Not Too Hot, Not Too Cold: The Bundled-SVM is Just Right! , 2002 .

[32]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[33]  Chih-Jen Lin,et al.  Training v-Support Vector Classifiers: Theory and Algorithms , 2001, Neural Computation.

[34]  Daphne Koller,et al.  Support Vector Machine Active Learning with Application sto Text Classification , 2000, ICML.

[35]  Chih-Jen Lin,et al.  A tutorial on?-support vector machines , 2005 .

[36]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[37]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.