Asymptotic distribution and sparsistency for `1 penalized parametric M-estimators, with applications to linear SVM and logistic regression

Since its early use in least squares regression problems, th e `1-penalization framework for variable selection has been employed in conjunction with a wide range of loss functions encompassing regression, classification and survival analysis. While a well develope d theory exists for the `1-penalized least squares estimates, few results concern the behavior of `1-penalized estimates for general loss functions. In this paper, we derive two results concerning penalized estimates for a wide array of penalty and loss functions. Our first result characterizes the asymptot ic distribution of penalized parametric Mestimators under mild conditions on the loss and penalty functions in the classical setting (fixed- p-largen). Our second result explicits necessary and sufficient gene ralized irrepresentability (GI) conditions for `1-penalized parametric M-estimates to consistently select the components of a model (sparsistency) as well as their sign (sign consistency). In general, the GI con ditions depend on the Hessian of the risk function at the true value of the unknown parameter. Under Gaussian predictors, we obtain a set of conditions under which the GI conditions can be re-expressed solely in terms of the second moment of the predictors. We apply our theory to contrast `1-penalized SVM and logistic regression classifiers and find conditions under which they have the same behavior in ter ms of their model selection consistency (sparsistency and sign consistency). Finally, we provide simulation evidence for the theory based on these classification examples.

[1]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[2]  H. Akaike A new look at the statistical model identification , 1974 .

[3]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[4]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[5]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[6]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[7]  D. Pollard Asymptotics for Least Absolute Deviation Regression Estimators , 1991, Econometric Theory.

[8]  Peter C. B. Phillips,et al.  A Shortcut to LAD Estimator Asymptotics , 1991, Econometric Theory.

[9]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[10]  J. Friedman,et al.  [A Statistical View of Some Chemometrics Regression Tools]: Response , 1993 .

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[13]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[14]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[15]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .

[16]  A. W. van der Vaart,et al.  Uniform Central Limit Theorems , 2001 .

[17]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[18]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[19]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[20]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[21]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[22]  N. Meinshausen,et al.  Consistent neighbourhood selection for sparse high-dimensional graphs with the Lasso , 2004 .

[23]  Alexandre d'Aspremont,et al.  Sparse Covariance Selection via Robust Maximum Likelihood Estimation , 2005, ArXiv.

[24]  Mee Young Park,et al.  L 1-regularization path algorithm for generalized linear models , 2006 .

[25]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[26]  Martin J. Wainwright,et al.  Sharp thresholds for high-dimensional and noisy recovery of sparsity , 2006, ArXiv.

[27]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[28]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[29]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[30]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[31]  Ingo Steinwart,et al.  Fast rates for support vector machines using Gaussian kernels , 2007, 0708.1838.

[32]  A. Tsybakov,et al.  Fast learning rates for plug-in classifiers , 2007, 0708.2321.

[33]  Bin Yu,et al.  High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence , 2008, 0811.3628.

[34]  Changyi Park,et al.  A Bahadur Representation of the Linear Support Vector Machine , 2008, J. Mach. Learn. Res..

[35]  Ji Zhu,et al.  L1-Norm Quantile Regression , 2008 .

[36]  Runze Li,et al.  Variable Selection in Semiparametric Regression Modeling. , 2008, Annals of statistics.

[37]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[38]  N. Meinshausen,et al.  LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA , 2008, 0806.0145.

[39]  Tong Zhang Some sharp performance bounds for least squares regression with L1 regularization , 2009, 0908.2869.

[40]  Kalliopi Mylona,et al.  Variable Selection via Nonconcave Penalized Likelihood in High Dimensional Medical Problems , 2009 .