Stability selection

Summary.  Estimation of structure, such as in variable selection, graphical modelling or cluster analysis, is notoriously difficult, especially for high dimensional data. We introduce stability selection. It is based on subsampling in combination with (high dimensional) selection algorithms. As such, the method is extremely general and has a very wide range of applicability. Stability selection provides finite sample control for some error rates of false discoveries and hence a transparent principle to choose a proper amount of regularization for structure estimation. Variable selection and structure estimation improve markedly for a range of selection methods if stability selection is applied. We prove for the randomized lasso that stability selection will be variable selection consistent even if the necessary conditions for consistency of the original lasso method are violated. We demonstrate stability selection for variable selection and Gaussian graphical modelling, using real and simulated data.

[1]  D. Lindley The Choice of Variables in Multiple Regression , 1968 .

[2]  David A. Freedman,et al.  A Remark on the Difference between Sampling with and without Replacement , 1977 .

[3]  H. Akaike SEASONAL ADJUSTMENT BY A BAYESIAN MODELING , 1980 .

[4]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[5]  H. Künsch The Jackknife and the Bootstrap for General Stationary Observations , 1989 .

[6]  M Schumacher,et al.  A bootstrap resampling procedure for model building: application to the Cox regression model. , 1992, Statistics in medicine.

[7]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[8]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[9]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[10]  E. George,et al.  APPROACHES FOR BAYESIAN VARIABLE SELECTION , 1997 .

[11]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[12]  Michael I. Jordan Graphical Models , 2003 .

[13]  T. Fearn,et al.  The choice of variables in multivariate regression: a non-conjugate Bayesian decision theory approach , 1999 .

[14]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[15]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .

[16]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[17]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Christian P. Robert,et al.  Variable selection in qualitative models via an entropic explanatory power , 2003 .

[19]  Marina Vannucci,et al.  Gene selection: a Bayesian variable selection approach , 2003, Bioinform..

[20]  Jun S. Liu,et al.  Integrating regulatory motif discovery and genome-wide expression analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[21]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[22]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[23]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[24]  Marina Vannucci,et al.  Bayesian Variable Selection in Multinomial Probit Models to Identify Molecular Signatures of Disease Stage , 2004, Biometrics.

[25]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[26]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[27]  Joel A. Tropp,et al.  Greed is good: algorithmic results for sparse approximation , 2004, IEEE Transactions on Information Theory.

[28]  J. Berger,et al.  Optimal predictive model selection , 2004, math/0406464.

[29]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[30]  H Ans C. Va High-dimensional data: p n in mathematical statistics and bio-medical applications , 2004 .

[31]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[32]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[33]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[34]  Thierry Moreau,et al.  A simple procedure for estimating the false discovery rate , 2005, Bioinform..

[35]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[36]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[37]  I. Johnstone,et al.  Empirical Bayes selection of wavelet thresholds , 2005, math/0508281.

[38]  Vittorio Castelli,et al.  Bayesian Nonparametrics via Neural Networks , 2005 .

[39]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[40]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[41]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[42]  Martin J. Wainwright,et al.  Sharp thresholds for high-dimensional and noisy recovery of sparsity , 2006, ArXiv.

[43]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.

[44]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[45]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[46]  Caroline C. Friedel,et al.  Reliable gene signatures for microarray classification: assessment of stability and performance , 2006, Bioinform..

[47]  G. Wahba,et al.  A NOTE ON THE LASSO AND RELATED PROCEDURES IN MODEL SELECTION , 2006 .

[48]  Cun-Hui Zhang PENALIZED LINEAR UNBIASED SELECTION , 2007 .

[49]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[50]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[51]  Chih-Ling Tsai,et al.  Regression coefficient and autoregressive order shrinkage and selection via the lasso , 2007 .

[52]  J. Griffin,et al.  Bayesian adaptive lassos with non-convex penalization , 2007 .

[53]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[54]  A. Tsybakov,et al.  Sparsity oracle inequalities for the Lasso , 2007, 0705.3308.

[55]  M. West,et al.  Shotgun Stochastic Search for “Large p” Regression , 2007 .

[56]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[57]  Sylvia Richardson,et al.  Statistical Applications in Genetics and Molecular Biology Comparing the Characteristics of Gene Expression Profiles Derived by Univariate and Multivariate Classification Methods , 2011 .

[58]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[59]  Jianqing Fan,et al.  Ultrahigh Dimensional Variable Selection: beyond the linear model , 2008, 0812.3201.

[60]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[61]  P. Bickel,et al.  Regularized estimation of large covariance matrices , 2008, 0803.1909.

[62]  Bin Yu,et al.  High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence , 2008, 0811.3628.

[63]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[64]  Cun-Hui Zhang,et al.  Adaptive Lasso for sparse high-dimensional regression models , 2008 .

[65]  Tong Zhang,et al.  Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models , 2008, NIPS.

[66]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[67]  Jeffrey S. Morris,et al.  Sure independence screening for ultrahigh dimensional feature space Discussion , 2008 .

[68]  Peter Bühlmann,et al.  p-Values for High-Dimensional Regression , 2008, 0811.2177.

[69]  Francis R. Bach,et al.  Bolasso: model consistent Lasso estimation through the bootstrap , 2008, ICML '08.

[70]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[71]  Adam J. Rothman,et al.  Sparse permutation invariant covariance estimation , 2008, 0801.4837.

[72]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[73]  Yichao Wu,et al.  Ultrahigh Dimensional Feature Selection: Beyond The Linear Model , 2009, J. Mach. Learn. Res..

[74]  N. Meinshausen,et al.  LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA , 2008, 0806.0145.

[75]  William Valdar,et al.  Mapping in Structured Populations by Resample Model Averaging , 2009, Genetics.

[76]  Tong Zhang,et al.  On the Consistency of Feature Selection using Greedy Least Squares Regression , 2009, J. Mach. Learn. Res..

[77]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[78]  Axel Gandy Sequential Implementation of Monte Carlo Tests With Uniformly Bounded Resampling Risk , 2009 .

[79]  A. Rinaldo,et al.  Generalized density clustering , 2009, 0907.3454.

[80]  Christian Hennig,et al.  Methods for merging Gaussian mixture components , 2010, Adv. Data Anal. Classif..

[81]  J. Griffin,et al.  Inference with normal-gamma prior distributions in regression problems , 2010 .

[82]  Sylvia Richardson,et al.  Evolutionary Stochastic Search for Bayesian model exploration , 2010, 1002.2706.

[83]  Sara van de Geer,et al.  Prediction and variable selection with the adaptive Lasso , 2010 .

[84]  David J. Nott,et al.  Computational Statistics and Data Analysis Bayesian Projection Approaches to Variable Selection in Generalized Linear Models , 2022 .