Near-Boundary Data Selection for Fast Suppor Vector Machines

Support Vector Machines(SVMs) have become more popular than other algorithms for pattern classification. The learning phase of a SVM involves exploring the subset of informative training examples (i.e. support vectors) that makes up a decision boundary. Those support vectors tend to lie close to the learned boundary. In view of nearest neighbor property, the neighbors of a support vector become more heterogeneous than those of a non-support vector. In this paper, we propose a data selection method that is based on the geometrical analysis of the relationship between nearest neighbors and boundary examples. With real-world problems, we evaluate the proposed data selection method in terms of generalization performance, data reduction rate, training time and the number of support vectors. The results show that the proposed method achieves a drastic reduction of both training data size and training time without significant impairment to generalization performance compared to the standard SVM.

[1]  Lance Chun Che Fung,et al.  Data Cleaning for Classification Using Misclassification Analysis , 2010, J. Adv. Comput. Intell. Intell. Informatics.

[2]  Xindong Wu,et al.  The Top Ten Algorithms in Data Mining , 2009 .

[3]  Ingo Steinwart,et al.  Sparseness of Support Vector Machines , 2003, J. Mach. Learn. Res..

[4]  Sungzoon Cho,et al.  Response modeling with support vector machines , 2006, Expert Syst. Appl..

[5]  Guodong Guo,et al.  Support Vector Machines Applications , 2014 .

[6]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[7]  R. A. Mollineda,et al.  The class imbalance problem in pattern classification and learning , 2009 .

[8]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[9]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[10]  Leon N. Cooper,et al.  Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence , 2006, Pattern Recognit..

[11]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[12]  Fabrizio Angiulli,et al.  Fast Nearest Neighbor Condensation for Large Data Sets Classification , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[16]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[17]  D. Hwang,et al.  NEAR-BOUNDARY DATA SELECTION FOR FAST SUPPORT , 2012 .

[18]  Xuegong Zhang,et al.  Kernel Nearest-Neighbor Algorithm , 2002, Neural Processing Letters.

[19]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[20]  Fernando Vilariño,et al.  Experiments with SVM and Stratified Sampling with an Imbalanced Problem: Detection of Intestinal Contractions , 2005, ICAPR.

[21]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[22]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[23]  Jinglu Hu,et al.  A fast SVM training method for very large datasets , 2009, 2009 International Joint Conference on Neural Networks.