Markov Neighborhood Regression for High-Dimensional Inference

This paper proposes an innovative method for constructing confidence intervals and assessing p-values in statistical inference for high-dimensional linear models. The proposed method has successfully broken the high-dimensional inference problem into a series of low-dimensional inference problems: For each regression coefficient $\beta_i$, the confidence interval and $p$-value are computed by regressing on a subset of variables selected according to the conditional independence relations between the corresponding variable $X_i$ and other variables. Since the subset of variables forms a Markov neighborhood of $X_i$ in the Markov network formed by all the variables $X_1,X_2,\ldots,X_p$, the proposed method is coined as Markov neighborhood regression. The proposed method is tested on high-dimensional linear, logistic and Cox regression. The numerical results indicate that the proposed method significantly outperforms the existing ones. Based on the Markov neighborhood regression, a method of learning causal structures for high-dimensional linear models is proposed and applied to identification of drug sensitive genes and cancer driver genes. The idea of using conditional independence relations for dimension reduction is general and potentially can be extended to other high-dimensional or big data problems as well.

[1]  Faming Liang,et al.  An Equivalent Measure of Partial Correlation Coefficients for High-Dimensional Gaussian Graphical Models , 2015 .

[2]  H. Zou,et al.  High Dimensional Inference , 2020 .

[3]  Manuel Hidalgo,et al.  Developing inhibitors of the epidermal growth factor receptor for cancer treatment. , 2003, Journal of the National Cancer Institute.

[4]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[5]  Yang Feng,et al.  SIS: An R Package for Sure Independence Screening in Ultrahigh-Dimensional Statistical Models , 2018 .

[6]  F. Liang,et al.  High-Dimensional Variable Selection With Reciprocal L1-Regularization , 2015 .

[7]  Guang Cheng,et al.  Simultaneous Inference for High-Dimensional Linear Models , 2016, 1603.01295.

[8]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[9]  Norbert Henze,et al.  A class of invariant consistent tests for multivariate normality , 1990 .

[10]  N. Meinshausen,et al.  High-Dimensional Inference: Confidence Intervals, $p$-Values and R-Software hdi , 2014, 1408.4026.

[11]  W. Reinhold,et al.  Putative DNA/RNA helicase Schlafen-11 (SLFN11) sensitizes cancer cells to DNA-damaging agents , 2012, Proceedings of the National Academy of Sciences.

[12]  M. West,et al.  Shotgun Stochastic Search for “Large p” Regression , 2007 .

[13]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[14]  Peter Bühlmann,et al.  p-Values for High-Dimensional Regression , 2008, 0811.2177.

[15]  F. Liang,et al.  Bayesian Subset Modeling for High-Dimensional Generalized Linear Models , 2013 .

[16]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[17]  A. Dobra Variable selection and dependency networks for genomewide data. , 2009, Biostatistics.

[18]  Yun Yang Statistical inference for high dimensional regression via Constrained Lasso , 2017, 1704.05098.

[19]  Victor Chernozhukov,et al.  Uniform post-selection inference for least absolute deviation regression and other Z-estimation problems , 2013, 1304.0282.

[20]  Faming Liang,et al.  Learning Moral Graphs in Construction of High-Dimensional Bayesian Networks for Mixed Data , 2019, Neural Computation.

[21]  J. Friedman,et al.  New Insights and Faster Computations for the Graphical Lasso , 2011 .

[22]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[23]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[24]  A. Buzdar,et al.  Role of biologic therapy and chemotherapy in hormone receptor- and HER2-positive breast cancer. , 2009, Annals of oncology : official journal of the European Society for Medical Oncology.

[25]  Faming Liang,et al.  A Robust Model-Free Feature Screening Method for Ultrahigh-Dimensional Data , 2017, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[26]  Kikuya Kato,et al.  Possible involvement of CCT5, RGS3, and YKT6 genes up-regulated in p53-mutated tumors in resistance to docetaxel in human breast cancers , 2007, Breast Cancer Research and Treatment.

[27]  W. Sellers,et al.  A Smac mimetic rescue screen reveals roles for inhibitor of apoptosis proteins in tumor necrosis factor-alpha signaling. , 2007, Cancer research.

[28]  Jianqing Fan,et al.  Variance estimation using refitted cross‐validation in ultrahigh dimensional regression , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[29]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[30]  Shikai Luo,et al.  Sure Screening for Gaussian Graphical Models , 2014, ArXiv.

[31]  Dennis L. Sun,et al.  Exact post-selection inference, with application to the lasso , 2013, 1311.6238.

[32]  Bin Yu,et al.  Asymptotic Properties of Lasso+mLS and Lasso+Ridge in Sparse High-dimensional Linear Regression , 2013, 1306.5505.

[33]  Jianqing Fan,et al.  Nonconcave penalized likelihood with a diverging number of parameters , 2004, math/0406466.

[34]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[35]  Dennis L. Sun,et al.  Optimal Inference After Model Selection , 2014, 1410.2597.

[36]  P. Bühlmann Statistical significance in high-dimensional linear models , 2013 .

[37]  Ravi Salgia,et al.  MET molecular mechanisms and therapies in lung cancer , 2010, Cell adhesion & migration.

[38]  S. Geer,et al.  On the conditions used to prove oracle results for the Lasso , 2009, 0910.0722.

[39]  R. Tibshirani,et al.  A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[40]  Cun-Hui Zhang,et al.  Confidence intervals for low dimensional parameters in high dimensional linear models , 2011, 1110.2563.

[41]  John D. Storey A direct approach to false discovery rates , 2002 .

[42]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[43]  Denver T Hendricks,et al.  Use of NQO1 status as a selective biomarker for oesophageal squamous cell carcinomas with greater sensitivity to 17-AAG , 2014, BMC Cancer.

[44]  Adam A. Margolin,et al.  The Cancer Cell Line Encyclopedia enables predictive modeling of anticancer drug sensitivity , 2012, Nature.

[45]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[46]  S. Lahiri,et al.  Rates of convergence of the Adaptive LASSO estimators to the Oracle distribution and higher order refinements by the bootstrap , 2013, 1307.1952.

[47]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[48]  Jeffrey S. Morris,et al.  Sure independence screening for ultrahigh dimensional feature space Discussion , 2008 .

[49]  Peter Bühlmann,et al.  High-dimensional variable screening and bias in subsequent inference, with an empirical comparison , 2013, Computational Statistics.

[50]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[51]  L. Wasserman,et al.  HIGH DIMENSIONAL VARIABLE SELECTION. , 2007, Annals of statistics.

[52]  F. Liang,et al.  A split‐and‐merge Bayesian variable selection approach for ultrahigh dimensional regression , 2015 .

[53]  Adel Javanmard,et al.  Confidence intervals and hypothesis testing for high-dimensional regression , 2013, J. Mach. Learn. Res..

[54]  J. Bertin,et al.  Differential roles of RIPK1 and RIPK3 in TNF-induced necroptosis and chemotherapeutic agent-induced cell death , 2015, Cell Death and Disease.

[55]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[56]  Katherina Baranova,et al.  Genomic signatures for paclitaxel and gemcitabine resistance in breast cancer derived by machine learning , 2016, Molecular oncology.

[57]  Christopher Meek,et al.  Causal inference and causal explanation with background knowledge , 1995, UAI.

[58]  Nicolai Meinshausen,et al.  Group bound: confidence intervals for groups of variables in sparse high dimensional regression without assumptions on the design , 2013, 1309.3489.

[59]  Peter Buhlmann Statistical significance in high-dimensional linear models , 2012, 1202.1377.

[60]  S. Geer,et al.  On asymptotically optimal confidence regions and tests for high-dimensional models , 2013, 1303.0518.

[61]  Trevor Hastie,et al.  Learning the Structure of Mixed Graphical Models , 2015, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[62]  Larry A. Wasserman,et al.  The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs , 2009, J. Mach. Learn. Res..

[63]  S. Portnoy Asymptotic Behavior of Likelihood Methods for Exponential Families when the Number of Parameters Tends to Infinity , 1988 .

[64]  E. Butcher,et al.  Regulation of Chemotactic and Proadhesive Responses to Chemoattractant Receptors by RGS (Regulator of G-protein Signaling) Family Members* , 1998, The Journal of Biological Chemistry.

[65]  M. Maathuis,et al.  Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm , 2009, 0906.3204.

[66]  Junqin He,et al.  Treatment of experimental human breast cancer and lung cancer brain metastases in mice by macitentan, a dual antagonist of endothelin receptors, combined with paclitaxel. , 2016, Neuro-oncology.

[67]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[68]  A. Buja,et al.  Valid post-selection inference , 2013, 1306.1059.

[69]  F. Liang,et al.  Estimating the false discovery rate using the stochastic approximation algorithm , 2008 .

[70]  Nicolai Meinshausen,et al.  Relaxed Lasso , 2007, Comput. Stat. Data Anal..

[71]  Runze Li,et al.  Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis , 2015, Journal of the American Statistical Association.

[72]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation , 2010, J. Mach. Learn. Res..

[73]  André Elisseeff,et al.  Using Markov Blankets for Causal Structure Learning , 2008, J. Mach. Learn. Res..