Subspace information criterion for nonquadratic regularizers-Model selection for sparse regressors

Nonquadratic regularizers, in particular the l(1) norm regularizer can yield sparse solutions that generalize well. In this work we propose the generalized subspace information criterion (GSIC) that allows to predict the generalization error for this useful family of regularizers. We show that under some technical assumptions GSIC is an asymptotically unbiased estimator of the generalization error. GSIC is demonstrated to have a good performance in experiments with the l(1) norm regularizer as we compare with the network information criterion (NIC) and cross- validation in relatively large sample cases. However in the small sample case, GSIC tends to fail to capture the optimal model due to its large variance. Therefore, also a biased version of GSIC is introduced,which achieves reliable model selection in the relevant and challenging scenario of high-dimensional data and few samples.

[1]  R. C. Williamson,et al.  Classification on proximity data with LP-machines , 1999 .

[2]  C. L. Mallows Some comments on C_p , 1973 .

[3]  Dale Schuurmans,et al.  An Adaptive Regularization Criterion for Supervised Learning , 2000, ICML.

[4]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[5]  Paul S. Bradley,et al.  Feature Selection via Mathematical Programming , 1997, INFORMS J. Comput..

[6]  Bernhard Schölkopf,et al.  Sparse Kernel Feature Analysis , 2002 .

[7]  Olvi L. Mangasarian,et al.  Mathematical Programming in Data Mining , 1997, Data Mining and Knowledge Discovery.

[8]  J. Weston,et al.  Support vector regression with ANOVA decomposition kernels , 1999 .

[9]  Vladimir Cherkassky,et al.  Model complexity control for regression using VC generalization bounds , 1999, IEEE Trans. Neural Networks.

[10]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[11]  Masashi Sugiyama,et al.  Optimal design of regularization term and regularization parameter by subspace information criterion , 2002, Neural Networks.

[12]  Siddhartha R. Dalal,et al.  Some Graphical Aids for Deciding When to Stop Testing Software , 1990, IEEE J. Sel. Areas Commun..

[13]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[14]  Masashi Sugiyama,et al.  Theoretical and Experimental Evaluation of the Subspace Information Criterion , 2002, Machine Learning.

[15]  Peter M. Williams,et al.  Bayesian Regularization and Pruning Using a Laplace Prior , 1995, Neural Computation.

[16]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[17]  F. Mosteller,et al.  Inference in an Authorship Problem , 1963 .

[18]  Hirotugu Akaike,et al.  Likelihood and the Bayes procedure , 1980 .

[19]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[20]  Tom Heskes,et al.  Bias/Variance Decompositions for Likelihood-Based Estimators , 1998, Neural Computation.

[21]  Masashi Sugiyama,et al.  Subspace Information Criterion for Model Selection , 2001, Neural Computation.

[22]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[23]  Hidetoshi Shimodaira An Application of Multiple Comparison Techniques to Model Selection , 1998 .

[24]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[25]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  N. Murata Bias of Estimators and Regularization Terms Noboru Murata , 1998 .

[27]  C. H. Oh,et al.  Some comments on , 1998 .

[28]  John Mark,et al.  Introduction to radial basis function networks , 1996 .

[29]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[30]  C. Mallows More comments on C p , 1995 .

[31]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[32]  Gunnar Rätsch,et al.  A Mathematical Programming Approach to the Kernel Fisher Algorithm , 2000, NIPS.

[33]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[34]  H. Akaike A new look at the statistical model identification , 1974 .

[35]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[36]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[37]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.