Guarding against Spurious Discoveries in High Dimensions

Many data-mining and statistical machine learning algorithms have been developed to select a subset of covariates to associate with a response variable. Spurious discoveries can easily arise in high-dimensional data analysis due to enormous possibilities of such selections. How can we know statistically our discoveries better than those by chance? In this paper, we define a measure of goodness of spurious fit, which shows how good a response variable can be fitted by an optimally selected subset of covariates under the null model, and propose a simple and effective LAMM algorithm to compute it. It coincides with the maximum spurious correlation for linear models and can be regarded as a generalized maximum spurious correlation. We derive the asymptotic distribution of such goodness of spurious fit for generalized linear models and L1-regression. Such an asymptotic distribution depends on the sample size, ambient dimension, the number of variables used in the fit, and the covariance information. It can be consistently estimated by multiplier bootstrapping and used as a benchmark to guard against spurious discoveries. It can also be applied to model selection, which considers only candidate models with goodness of fits better than those by spurious fits. The theory and method are convincingly illustrated by simulated examples and an application to the binary outcomes from German Neuroblastoma Trials.

[1]  D. Bertsimas,et al.  Best Subset Selection via a Modern Optimization Lens , 2015, 1507.03133.

[2]  Jianqing Fan,et al.  Distributions of angles in random packing on spheres , 2013, J. Mach. Learn. Res..

[3]  Vladimir Spokoiny,et al.  Bootstrap confidence sets under model misspecification , 2014, 1410.0347.

[4]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[5]  V. Spokoiny Bernstein - von Mises Theorem for growing parameter dimension , 2013, 1302.3430.

[6]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[7]  Jianqing Fan,et al.  ARE DISCOVERIES SPURIOUS? DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS AND THEIR APPLICATIONS. , 2015, Annals of statistics.

[8]  S. S. Wilks The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .

[9]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[10]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[11]  Jianqing Fan,et al.  I-LAMM FOR SPARSE LEARNING: SIMULTANEOUS CONTROL OF ALGORITHMIC COMPLEXITY AND STATISTICAL ERROR. , 2015, Annals of statistics.

[12]  V. Spokoiny Parametric estimation. Finite sample theory , 2011, 1111.3029.

[13]  D. Hunter,et al.  Optimization Transfer Using Surrogate Objective Functions , 2000 .

[14]  Cun-Hui Zhang,et al.  Scaled sparse linear regression , 2011, 1104.4595.

[15]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  T. Tony Cai,et al.  Phase transition in limiting distributions of coherence of high-dimensional random matrices , 2011, J. Multivar. Anal..

[18]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[19]  Kengo Kato,et al.  Gaussian approximation of suprema of empirical processes , 2014 .

[20]  H. Zou,et al.  One-step Sparse Estimates in Nonconcave Penalized Likelihood Models. , 2008, Annals of statistics.

[21]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[22]  Gudmundur A. Thorisson,et al.  The International HapMap Project Web site. , 2005, Genome research.

[23]  Jianqing Fan,et al.  Variance estimation using refitted cross‐validation in ultrahigh dimensional regression , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[24]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[25]  Lie Wang The L1L1 penalized LAD estimator for high dimensional linear regression , 2013, J. Multivar. Anal..

[26]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[27]  P. Schmidt,et al.  Limited-Dependent and Qualitative Variables in Econometrics. , 1984 .

[28]  Patrick Warnat,et al.  Customized oligonucleotide microarray gene expression-based classification of neuroblastoma patients outperforms current clinical risk stratification. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.