Covariance-Insured Screening

Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors far greater than the sample size. In order to identify more novel biomarkers and understand biological mechanisms, it is vital to detect signals weakly associated with outcomes among ultrahigh-dimensional predictors. However, existing screening methods, which typically ignore correlation information, are likely to miss weak signals. By incorporating the inter-feature dependence, a covariance-insured screening approach is proposed to identify predictors that are jointly informative but marginally weakly associated with outcomes. The validity of the method is examined via extensive simulations and a real data study for selecting potential genetic factors related to the onset of multiple myeloma.

[1]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[2]  P. Bickel,et al.  Covariance regularization by thresholding , 2009, 0901.3079.

[3]  Christian P. Robert,et al.  Large-scale inference , 2010 .

[4]  Xiang Zhou,et al.  Differential expression analysis for RNAseq using Poisson mixed models , 2016, bioRxiv.

[5]  S. Rhee,et al.  Regulation of phosphoinositide-specific phospholipase C. , 2001, Annual review of biochemistry.

[6]  Hansheng Wang Forward Regression for Ultra-High Dimensional Variable Screening , 2009 .

[7]  Qi Zhang,et al.  Optimality of graphlet screening in high dimensional variable selection , 2012, J. Mach. Learn. Res..

[8]  Shimon Even,et al.  Graph Algorithms , 1979 .

[9]  Jessika Weiss,et al.  Graphical Models In Applied Multivariate Statistics , 2016 .

[10]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[11]  R. Kyle,et al.  Multiple myeloma. , 2008, Blood.

[12]  Yi Li,et al.  Principled sure independence screening for Cox models with ultra-high-dimensional covariates , 2012, J. Multivar. Anal..

[13]  Lan Wang,et al.  Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data , 2013, 1304.2186.

[14]  Seongho Kim ppcor: An R Package for a Fast Calculation to Semi-partial Correlation Coefficients. , 2015, Communications for statistical applications and methods.

[15]  Yang Feng,et al.  Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models , 2009, Journal of the American Statistical Association.

[16]  Yongsheng Huang,et al.  A validated gene expression model of high-risk multiple myeloma is defined by deregulated expression of genes mapping to chromosome 1. , 2006, Blood.

[17]  T. Roberts,et al.  Predicting the response of multiple myeloma to the proteasome inhibitor Bortezomib by evaluation of the unfolded protein response , 2016, Blood Cancer Journal.

[18]  Bradley Efron,et al.  Large-scale inference , 2010 .

[19]  Pei Wang,et al.  Partial Correlation Estimation by Joint Sparse Regression Models , 2008, Journal of the American Statistical Association.

[20]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[21]  D C Johnson,et al.  MMSET is the key molecular target in t(4;14) myeloma , 2013, Blood Cancer Journal.

[22]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[23]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[24]  Shimon Even,et al.  Graph Algorithms, Second Edition , 2012 .

[25]  Yi Li,et al.  Score test variable screening , 2014, Biometrics.

[26]  Joseph K. Pickrell,et al.  Approximately independent linkage disequilibrium blocks in human populations , 2015, bioRxiv.

[27]  P. Fryzlewicz,et al.  High dimensional variable selection via tilting , 2012, 1611.08640.

[28]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[29]  Matilda Katan,et al.  Regulatory links between PLC enzymes and Ras superfamily GTPases: signalling via PLCepsilon. , 2009, Advances in enzyme regulation.

[30]  Runsheng Chen,et al.  Expression profiling and functional prediction of long noncoding RNAs in nasopharyngeal nonkeratinizing carcinoma. , 2016, Discovery medicine.

[31]  Adam J. Rothman,et al.  Generalized Thresholding of Large Covariance Matrices , 2009 .

[32]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[33]  Chenlei Leng,et al.  High dimensional ordinary least squares projection for screening variables , 2015, 1506.01782.

[34]  Runze Li,et al.  Model-Free Feature Screening for Ultrahigh-Dimensional Data , 2011, Journal of the American Statistical Association.

[35]  Jun Zhang,et al.  Robust rank correlation based screening , 2010, 1012.4255.

[36]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[37]  Isaac Dialsingh,et al.  Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , 2012 .

[38]  Song He,et al.  Effects of EHD2 interference on migration of esophageal squamous cell carcinoma , 2013, Medical Oncology.

[39]  Yi Li,et al.  Component-wise gradient boosting and false discovery control in survival analysis with high-dimensional covariates , 2015, Bioinform..

[40]  Jeffrey S. Morris,et al.  Sure independence screening for ultrahigh dimensional feature space Discussion , 2008 .

[41]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[42]  K. Mrozik,et al.  PTTG1 expression is associated with hyperproliferative disease and poor prognosis in multiple myeloma , 2015, Journal of Hematology & Oncology.

[43]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[44]  M. Maathuis,et al.  Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm , 2009, 0906.3204.

[45]  Trevor J Pugh,et al.  Initial genome sequencing and analysis of multiple myeloma , 2011, Nature.