Dimensionality Reduction via Sparse Support Vector Machines

We describe a methodology for performing variable ranking and selection using support vector machines (SVMs). The method constructs a series of sparse linear SVMs to generate linear models that can generalize well, and uses a subset of nonzero weighted variables found by the linear models to produce a final nonlinear model. The method exploits the fact that a linear SVM (no kernels) with l1-norm regularization inherently performs variable selection as a side-effect of minimizing capacity of the SVM model. The distribution of the linear model weights provides a mechanism for ranking and interpreting the effects of variables. Starplots are used to visualize the magnitude and variance of the weights for each variable. We illustrate the effectiveness of the methodology on synthetic data, benchmark problems, and challenging regression problems in drug design. This method can dramatically reduce the number of variables and outperforms SVMs trained using all attributes and using the attributes selected according to correlation coefficients. The visualization of the resulting models is useful for understanding the role of underlying variables.

[1]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[2]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[3]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[4]  Gérard Dreyfus,et al.  Ranking a Random Feature for Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[5]  Kristin P. Bennett,et al.  A Pattern Search Method for Model Selection of Support Vector Regression , 2002, SDM.

[6]  Mineichi Kudo,et al.  Comparison of Classifier-Specific Feature Selection Algorithms , 2000, SSPR/SPR.

[7]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[8]  Alan J. Miller,et al.  Subset Selection in Regression , 1991 .

[9]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[10]  Download Book,et al.  Information Visualization in Data Mining and Knowledge Discovery , 2001 .

[11]  A. Atkinson Subset Selection in Regression , 1992 .

[12]  Virginia Torczon,et al.  DERIVATIVE-FREE PATTERN SEARCH METHODS FOR MULTIDISCIPLINARY DESIGN PROBLEMS , 1994 .

[13]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[14]  D. Rubinfeld,et al.  Hedonic housing prices and the demand for clean air , 1978 .

[15]  Kristin P. Bennett,et al.  Bagging neural network sensitivity analysis for feature reduction for in-silico drug design , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[16]  B. Schölkopf,et al.  Linear programs for automatic accuracy control in regression. , 1999 .

[17]  Jinbo Bi,et al.  Prediction of Protein Retention Times in Anion-Exchange Chromatography Systems Using Support Vector Regression , 2002, J. Chem. Inf. Comput. Sci..

[18]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[19]  Kristin P. Bennett,et al.  Combining support vector and mathematical programming methods for classification , 1999 .

[20]  J. R. Koehler,et al.  Modern Applied Statistics with S-Plus. , 1996 .

[21]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[22]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[23]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[24]  Paul S. Bradley,et al.  Parsimonious Least Norm Approximation , 1998, Comput. Optim. Appl..

[25]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[26]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[27]  A. Meyer-Bäse Feature Selection and Extraction , 2004 .

[28]  Colin Campbell,et al.  A Linear Programming Approach to Novelty Detection , 2000, NIPS.

[29]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[30]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.