Adaptive Monotone Shrinkage for Regression

We develop an adaptive monotone shrinkage estimator for regression models with the following characteristics: i) dense coefficients with small but important effects; ii) a priori ordering that indicates the probable predictive importance of the features. We capture both properties with an empirical Bayes estimator that shrinks coefficients monotonically with respect to their anticipated importance. This estimator can be rapidly computed using a version of Pool-Adjacent-Violators algorithm. We show that the proposed monotone shrinkage approach is competitive with the class of all Bayesian estimators that share the prior information. We further observe that the estimator also minimizes Stein's unbiased risk estimate. Along with our key result that the estimator mimics the oracle Bayes rule under an order assumption, we also prove that the estimator is robust. Even without the order assumption, our estimator mimics the best performance of a large family of estimators that includes the least squares estimator, constant-$\lambda$ ridge estimator, James-Stein estimator, etc. All the theoretical results are non-asymptotic. Simulation results and data analysis from a model for text processing are provided to support the theory.

[1]  I. Johnstone,et al.  Adapting to Unknown Smoothness via Wavelet Shrinkage , 1995 .

[2]  L. Dicker Dense Signals, Linear Estimators, and Out-of-Sample Prediction for High-Dimensional Linear Models , 2011, 1102.2952.

[3]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[4]  C. Witzgall,et al.  Projections onto order simplexes , 1984 .

[5]  Ker-Chau Li,et al.  From Stein's Unbiased Risk Estimates to the Method of Generalized Cross Validation , 1985 .

[6]  David L. Donoho,et al.  De-noising by soft-thresholding , 1995, IEEE Trans. Inf. Theory.

[7]  Lee H. Dicker,et al.  Optimal Estimation and Prediction for Dense Signals in High-Dimensional Linear Models , 2012, 1203.4572.

[8]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[9]  P. Hall,et al.  Feature selection when there are many influential features , 2009, 0911.4076.

[10]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[11]  Lawrence D. Brown,et al.  SURE Estimates for a Heteroscedastic Hierarchical Model , 2012, Journal of the American Statistical Association.

[12]  I. Johnstone,et al.  Adapting to unknown sparsity by controlling the false discovery rate , 2005, math/0505374.

[13]  Mark Liberman,et al.  Featurizing Text : Converting Text into Predictors for Regression Analysis , 2013 .

[14]  Harrison H. Zhou,et al.  A data-driven block thresholding approach to wavelet estimation , 2009, 0903.5147.

[15]  H. D. Brunk Maximum Likelihood Estimates of Monotone Parameters , 1955 .

[16]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[17]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[18]  H. D. Brunk,et al.  AN EMPIRICAL DISTRIBUTION FUNCTION FOR SAMPLING WITH INCOMPLETE INFORMATION , 1955 .

[19]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  H. D. Brunk On the Estimation of Parameters Restricted by Inequalities , 1958 .

[21]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[22]  T. Cai Adaptive wavelet estimation : A block thresholding and oracle inequality approach , 1999 .

[23]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .