Significance testing in non-sparse high-dimensional linear models

In high-dimensional linear models, the sparsity assumption is typically made, stating that most of the parameters are equal to zero. Under the sparsity assumption, estimation and, recently, inference have been well studied. However, in practice, sparsity assumption is not checkable and more importantly is often violated; a large number of covariates might be expected to be associated with the response, indicating that possibly all, rather than just a few, parameters are non-zero. A natural example is a genome-wide gene expression profiling, where all genes are believed to affect a common disease marker. We show that existing inferential methods are sensitive to the sparsity assumption, and may, in turn, result in the severe lack of control of Type-I error. In this article, we propose a new inferential method, named CorrT, which is robust to model misspecification such as heteroscedasticity and lack of sparsity. CorrT is shown to have Type I error approaching the nominal level for \textit{any} models and Type II error approaching zero for sparse and many dense models. In fact, CorrT is also shown to be optimal in a variety of frameworks: sparse, non-sparse and hybrid models where sparse and dense signals are mixed. Numerical experiments show a favorable performance of the CorrT test compared to the state-of-the-art methods.

[1]  Mark J van der Laan,et al.  Empirical Efficiency Maximization: Improved Locally Efficient Covariate Adjustment in Randomized Experiments and Survival Analysis , 2008, The international journal of biostatistics.

[2]  T. Hsia,et al.  Benzyl isothiocyanate alters the gene expression with cell cycle regulation and cell death in human brain glioblastoma GBM 8401 cells. , 2016, Oncology reports.

[3]  James M. Robins,et al.  Unified Methods for Censored Longitudinal Data and Causality , 2003 .

[4]  W. Newey,et al.  The asymptotic variance of semiparametric estimators , 1994 .

[5]  P. Hall,et al.  Feature selection when there are many influential features , 2009, 0911.4076.

[6]  Prem S. Puri,et al.  On Optimal Asymptotic Tests of Composite Statistical Hypotheses , 1967 .

[7]  J. Robins,et al.  Doubly Robust Estimation of a Marginal Average Effect of Treatment on the Treated With an Instrumental Variable , 2015 .

[8]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[9]  Han Liu,et al.  A General Theory of Hypothesis Tests and Confidence Regions for Sparse High Dimensional Models , 2014, 1412.8765.

[10]  D. Yee,et al.  Insulin-like growth factors in human breast cancer , 1991, Breast Cancer Research and Treatment.

[11]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[12]  M. Pike,et al.  A comprehensive analysis of the androgen receptor gene and risk of breast cancer: results from the National Cancer Institute Breast and Prostate Cancer Cohort Consortium (BPC3) , 2006, Breast Cancer Research.

[13]  Robert J. Vanderbei,et al.  Linear Programming: Foundations and Extensions , 1998, Kluwer international series in operations research and management service.

[14]  Rachel Ward,et al.  Compressed Sensing With Cross Validation , 2008, IEEE Transactions on Information Theory.

[15]  S. Dudoit,et al.  Multiple Testing Procedures with Applications to Genomics , 2007 .

[16]  A. Tsybakov,et al.  Sharp adaptation for inverse problems with random noise , 2002 .

[17]  F. Couch,et al.  Response: Re: Molecular Basis for Estrogen Receptor α Deficiency in BRCA1-Linked Breast Cancer , 2007 .

[18]  J. Robins,et al.  Semiparametric Efficiency in Multivariate Regression Models with Missing Data , 1995 .

[19]  V. Mitev,et al.  [Insulin-like growth factors]. , 1990, Eksperimentalna meditsina i morfologiia.

[20]  Ilya Shpitser,et al.  Semiparametric Theory for Causal Mediation Analysis: efficiency bounds, multiple robustness, and sensitivity analysis. , 2012, Annals of statistics.

[21]  Cun-Hui Zhang,et al.  Scaled sparse linear regression , 2011, 1104.4595.

[22]  P. Futreal,et al.  Novel consensus DNA‐binding sequence for BRCA1 protein complexes , 2003, Molecular carcinogenesis.

[23]  Mohamed-Ashick M. Saleem,et al.  The inactive X chromosome is epigenetically unstable and transcriptionally labile in breast cancer , 2015, Genome research.

[24]  Shuheng Zhou,et al.  25th Annual Conference on Learning Theory Reconstruction from Anisotropic Random Measurements , 2022 .

[25]  Alexandre B. Tsybakov,et al.  Pivotal Estimation in High-Dimensional Regression via Linear Programming , 2013, Empirical Inference.

[26]  Adel Javanmard,et al.  Confidence intervals and hypothesis testing for high-dimensional regression , 2013, J. Mach. Learn. Res..

[27]  M. Bianchini,et al.  Metallothionein 1G promotes the differentiation of HT-29 human colorectal cancer cells , 2017, Oncology reports.

[28]  Eric B. Laber,et al.  Doubly Robust Learning for Estimating Individualized Treatment with Censored Data. , 2015, Biometrika.

[29]  I. Johnstone,et al.  Adapting to Unknown Smoothness via Wavelet Shrinkage , 1995 .

[30]  Mark van der Laan,et al.  Use of a machine learning framework to predict substance use disorder treatment success , 2017, PloS one.

[31]  T. Seufferlein,et al.  Mass spectrometry‐based secretome analysis of non‐small cell lung cancer cell lines , 2016, Proteomics.

[32]  Peter Kraft,et al.  Genetic risk prediction--are we there yet? , 2009, The New England journal of medicine.

[33]  B. Harshbarger An Introduction to Probability Theory and its Applications, Volume I , 1958 .

[34]  T. Hastie,et al.  CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING. , 2015, Annals of statistics.

[35]  Kathryn Roeder,et al.  Testing for an Unusual Distribution of Rare Variants , 2011, PLoS genetics.

[36]  L. Wasserman,et al.  HIGH DIMENSIONAL VARIABLE SELECTION. , 2007, Annals of statistics.

[37]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[38]  I. Johnstone,et al.  Minimax Risk over l p-Balls for l q-error , 1994 .

[39]  Asha A. Nair,et al.  Retinoblastoma Binding Protein 4 Modulates Temozolomide Sensitivity in Glioblastoma by Regulating DNA Repair Proteins. , 2016, Cell reports.

[40]  Lee H. Dicker,et al.  Ridge regression and asymptotic minimax estimation over spheres of growing dimension , 2016, 1601.03900.

[41]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[42]  Jianqing Fan,et al.  A Selective Overview of Variable Selection in High Dimensional Feature Space. , 2009, Statistica Sinica.

[43]  H. White A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity , 1980 .

[44]  Robert C Murphy,et al.  Deletion of 5-Lipoxygenase in the Tumor Microenvironment Promotes Lung Cancer Progression and Metastasis through Regulating T Cell Recruitment , 2016, The Journal of Immunology.

[45]  M. Ringnér,et al.  Characterisation of amplification patterns and target genes at chromosome 11q13 in CCND1-amplified sporadic and familial breast tumours , 2012, Breast Cancer Research and Treatment.

[46]  Lucas Janson,et al.  EigenPrism: inference for high dimensional signal‐to‐noise ratios , 2015, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[47]  J. Pritchard Are rare variants responsible for susceptibility to complex diseases? , 2001, American journal of human genetics.

[48]  Kengo Kato,et al.  Valid Post-Selection Inference in High-Dimensional Approximately Sparse Quantile Regression Models , 2013, Journal of the American Statistical Association.

[49]  M. Fu,et al.  A transcriptional miRNA-gene network associated with lung adenocarcinoma metastasis based on the TCGA database. , 2016, Oncology reports.

[50]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[51]  Yuan Liao,et al.  A Lava Attack on the Recovery of Sums of Dense and Sparse Signals , 2015, ArXiv.

[52]  A. Borovkov Estimates for the distribution of sums and maxima of sums of random variables without the cramer condition , 2000 .

[53]  James M. Robins,et al.  Semiparametric Regression for Repeated Outcomes With Nonignorable Nonresponse , 1998 .

[54]  Cun-Hui Zhang,et al.  Confidence intervals for low dimensional parameters in high dimensional linear models , 2011, 1110.2563.

[55]  T. Tony Cai,et al.  Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity , 2015, 1506.05539.

[56]  Christian Hansen,et al.  Double/Debiased/Neyman Machine Learning of Treatment Effects , 2017, 1701.08687.

[57]  Yinchu Zhu,et al.  Linear Hypothesis Testing in Dense High-Dimensional Linear Models , 2016, Journal of the American Statistical Association.

[58]  Adel Javanmard,et al.  Hypothesis Testing in High-Dimensional Regression Under the Gaussian Random Design Model: Asymptotic Theory , 2013, IEEE Transactions on Information Theory.

[59]  E. Rio,et al.  A Bernstein type inequality and moderate deviations for weakly dependent sequences , 2009, 0902.0582.

[60]  S. Geer,et al.  On asymptotically optimal confidence regions and tests for high-dimensional models , 2013, 1303.0518.

[61]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[62]  Adel Javanmard,et al.  Debiasing the lasso: Optimal sample size for Gaussian designs , 2015, The Annals of Statistics.

[63]  M. Ellis,et al.  The mannose 6-phosphate/insulin-like growth factor 2 receptor (M6P/IGF2R), a putative breast tumor suppressor gene , 1998, Breast Cancer Research and Treatment.

[64]  Yu. I. Ingster,et al.  Detection boundary in sparse regression , 2010, 1009.1706.

[65]  D. V. Lindley,et al.  An Introduction to Probability Theory and Its Applications. Volume II , 1967, The Mathematical Gazette.

[66]  P. Hall,et al.  Martingale Limit Theory and Its Application , 1980 .

[67]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[68]  A. Belloni,et al.  Square-Root Lasso: Pivotal Recovery of Sparse Signals via Conic Programming , 2010, 1009.5689.

[69]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[70]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[71]  J. Robins,et al.  Analysis of semiparametric regression models for repeated outcomes in the presence of missing data , 1995 .

[73]  Christian Hansen,et al.  Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach , 2015 .

[74]  T. Lai,et al.  Self-Normalized Processes: Limit Theory and Statistical Applications , 2001 .

[75]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[76]  H. Avraham,et al.  A Novel Tricomplex of BRCA1, Nmi, and c-Myc Inhibits c-Myc-induced Human Telomerase Reverse Transcriptase Gene (hTERT) Promoter Activity in Breast Cancer* , 2002, The Journal of Biological Chemistry.

[77]  James B. Orlin,et al.  Parametric linear programming and anti-cycling pivoting rules , 1988, Math. Program..

[78]  Martin J. Wainwright,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.

[79]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[80]  A. Cimmino,et al.  Urinary long noncoding RNAs in nonmuscle‐invasive bladder cancer: new architects in cancer prognostic biomarkers , 2017, Translational research : the journal of laboratory and clinical medicine.

[81]  Victor Chernozhukov,et al.  Inference on Treatment Effects after Selection Amongst High-Dimensional Controls , 2011 .

[82]  R. Park Estimation with Heteroscedastic Error Terms , 1966 .

[83]  Jian-feng Dong,et al.  MiR-424 Promotes Non-Small Cell Lung Cancer Progression and Metastasis through Regulating the Tumor Suppressor Gene TNFAIP1 , 2017, Cellular Physiology and Biochemistry.

[84]  A. Harvey Estimating Regression Models with Multiplicative Heteroscedasticity , 1976 .

[85]  Xihong Lin,et al.  Optimal tests for rare variant effects in sequencing association studies. , 2012, Biostatistics.

[86]  Edward H. Kennedy Semiparametric theory , 2017, 1709.06418.

[87]  Liping Zhu,et al.  Doubly robust and efficient estimators for heteroscedastic partially linear single‐index models allowing high dimensional covariates , 2013, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[88]  Jiashun Jin,et al.  Higher Criticism for Large-Scale Inference: especially for Rare and Weak effects , 2014, 1410.4743.

[89]  Alexandre B. Tsybakov,et al.  Optimal adaptive estimation of linear functionals under sparsity , 2016, The Annals of Statistics.

[90]  Robert J. Vanderbei,et al.  The fastclime package for linear programming and large-scale precision matrix estimation in R , 2014, J. Mach. Learn. Res..