M ay 2 01 8 Covariance-Insured Screening

Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors far greater than the sample size. In order to identify more novel biomarkers and understand biological mechanisms, it is vital to detect signals weakly associated with outcomes among ultrahigh-dimensional predictors. However, existing screening methods, which typically ignore correlation information, are likely to miss these weak signals. By incorporating the inter-feature dependence, we propose a covariance-insured screening methodology to identify predictors that are jointly informative but only marginally weakly associated with outcomes. The validity of the method is examined via extensive simulations and real data studies for selecting potential genetic factors related to the onset of cancer.

[1]  O. Cope,et al.  Multiple myeloma. , 1948, The New England journal of medicine.

[2]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[3]  S. Rhee,et al.  Regulation of phosphoinositide-specific phospholipase C. , 2001, Annual review of biochemistry.

[4]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[5]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[6]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[7]  Yongsheng Huang,et al.  A validated gene expression model of high-risk multiple myeloma is defined by deregulated expression of genes mapping to chromosome 1. , 2006, Blood.

[8]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[9]  Pei Wang,et al.  Partial Correlation Estimation by Joint Sparse Regression Models , 2008, Journal of the American Statistical Association.

[10]  Hansheng Wang Forward Regression for Ultra-High Dimensional Variable Screening , 2009 .

[11]  P. Bickel,et al.  Covariance regularization by thresholding , 2009, 0901.3079.

[12]  Matilda Katan,et al.  Regulatory links between PLC enzymes and Ras superfamily GTPases: signalling via PLCepsilon. , 2009, Advances in enzyme regulation.

[13]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[14]  M. Maathuis,et al.  Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm , 2009, 0906.3204.

[15]  Yang Feng,et al.  Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models , 2009, Journal of the American Statistical Association.

[16]  Trevor J Pugh,et al.  Initial genome sequencing and analysis of multiple myeloma , 2011, Nature.

[17]  Runze Li,et al.  Model-Free Feature Screening for Ultrahigh-Dimensional Data , 2011, Journal of the American Statistical Association.

[18]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[19]  Jun Zhang,et al.  Robust rank correlation based screening , 2010, 1012.4255.

[20]  Christian P. Robert,et al.  Large-scale inference , 2010 .

[21]  Yi Li,et al.  Principled sure independence screening for Cox models with ultra-high-dimensional covariates , 2012, J. Multivar. Anal..

[22]  P. Fryzlewicz,et al.  High dimensional variable selection via tilting , 2012, 1611.08640.

[23]  Lan Wang,et al.  Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data , 2013, 1304.2186.

[24]  Song He,et al.  Effects of EHD2 interference on migration of esophageal squamous cell carcinoma , 2013, Medical Oncology.

[25]  D C Johnson,et al.  MMSET is the key molecular target in t(4;14) myeloma , 2013, Blood Cancer Journal.

[26]  Yi Li,et al.  Score test variable screening , 2014, Biometrics.

[27]  Qi Zhang,et al.  Optimality of graphlet screening in high dimensional variable selection , 2012, J. Mach. Learn. Res..

[28]  Seongho Kim ppcor: An R Package for a Fast Calculation to Semi-partial Correlation Coefficients. , 2015, Communications for statistical applications and methods.

[29]  K. Mrozik,et al.  PTTG1 expression is associated with hyperproliferative disease and poor prognosis in multiple myeloma , 2015, Journal of Hematology & Oncology.

[30]  Chenlei Leng,et al.  High dimensional ordinary least squares projection for screening variables , 2015, 1506.01782.

[31]  Jessika Weiss,et al.  Graphical Models In Applied Multivariate Statistics , 2016 .

[32]  T. Roberts,et al.  Predicting the response of multiple myeloma to the proteasome inhibitor Bortezomib by evaluation of the unfolded protein response , 2016, Blood Cancer Journal.

[33]  Yi Li,et al.  Component-wise gradient boosting and false discovery control in survival analysis with high-dimensional covariates , 2015, Bioinform..

[34]  Runsheng Chen,et al.  Expression profiling and functional prediction of long noncoding RNAs in nasopharyngeal nonkeratinizing carcinoma. , 2016, Discovery medicine.

[35]  Joseph K. Pickrell,et al.  Approximately independent linkage disequilibrium blocks in human populations , 2015, bioRxiv.

[36]  Xiang Zhou,et al.  Differential expression analysis for RNAseq using Poisson mixed models , 2016, bioRxiv.