Semi-Supervised Statistical Inference for High-Dimensional Linear Regression with Blockwise Missing Data

Blockwise missing data occurs frequently when we integrate multisource or multimodality data where different sources or modalities contain complementary information. In this paper, we consider a high-dimensional linear regression model with blockwise missing covariates and a partially observed response variable. Under this semi-supervised framework, we propose a computationally efficient estimator for the regression coefficient vector based on carefully constructed unbiased estimating equations and a multiple blockwise imputation procedure, and obtain its rates of convergence. Furthermore, building upon an innovative semi-supervised projected estimating equation technique that intrinsically achieves biascorrection of the initial estimator, we propose nearly unbiased estimators for the individual regression coefficients that are asymptotically normally distributed under mild conditions. By carefully analyzing these debiased estimators, asymptotically valid confidence intervals and statistical tests about each regression coefficient are constructed. Numerical studies and application analysis of the Alzheimer’s Disease Neuroimaging Initiative data show that the proposed method performs better and benefits more from unsupervised samples than existing methods.

[1]  Adel Javanmard,et al.  Debiasing the lasso: Optimal sample size for Gaussian designs , 2015, The Annals of Statistics.

[2]  Yang Ning,et al.  Optimal Semi-supervised Estimation and Inference for High-dimensional Linear Regression , 2020 .

[3]  Shuheng Zhou Restricted Eigenvalue Conditions on Subgaussian Random Matrices , 2009, 0912.4045.

[4]  Daniel Rueckert,et al.  Evaluating Imputation Techniques for Missing Data in ADNI: A Patient Classification Study , 2015, CIARP.

[5]  Gretel Sanabria-Diaz,et al.  Glucose Metabolism during Resting State Reveals Abnormal Brain Networks Organization in the Alzheimer’s Disease and Mild Cognitive Impairment , 2013, PloS one.

[6]  N. Chatterjee,et al.  Generalized meta-analysis for multiple regression models across studies with disparate covariate information. , 2017, Biometrika.

[7]  Paul M. Thompson,et al.  Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data , 2012, NeuroImage.

[8]  N. Verzelen Minimax risks for sparse regressions: Ultra-high-dimensional phenomenons , 2010, 1008.0526.

[9]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[10]  S. Mendelson,et al.  Reconstruction and Subgaussian Operators in Asymptotic Geometric Analysis , 2007 .

[11]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[12]  Hongzhe Li,et al.  Global and Simultaneous Hypothesis Testing for High-Dimensional Logistic Regression Models , 2018, Journal of the American Statistical Association.

[13]  Anru Zhang,et al.  Structured Matrix Completion with Applications to Genomic Data Integration , 2015, Journal of the American Statistical Association.

[14]  Zhongheng Zhang,et al.  Missing data imputation: focusing on single imputation. , 2016, Annals of translational medicine.

[15]  Anru Zhang,et al.  Semi-supervised inference: General theory and estimation of means , 2016, The Annals of Statistics.

[16]  Larry A. Wasserman,et al.  Statistical Analysis of Semi-Supervised Regression , 2007, NIPS.

[17]  Adel Javanmard,et al.  False Discovery Rate Control via Debiased Lasso , 2018, Electronic Journal of Statistics.

[18]  Han Liu,et al.  A Unified Theory of Confidence Regions and Testing for High-Dimensional Estimating Equations , 2015, Statistical Science.

[19]  D. Rueckert,et al.  Structural brain imaging in Alzheimer’s disease and mild cognitive impairment: biomarker analysis and shared morphometry database , 2018, Scientific Reports.

[20]  Malek Adjouadi,et al.  Significance of Normalization on Anatomical MRI Measures in Predicting Alzheimer's Disease , 2014, TheScientificWorldJournal.

[21]  Martin J. Wainwright,et al.  Restricted strong convexity and weighted matrix completion: Optimal bounds with noise , 2010, J. Mach. Learn. Res..

[22]  Tianxi Cai,et al.  Semi‐supervised approaches to efficient evaluation of model prediction performance , 2017, 1711.05663.

[23]  Fei Xue,et al.  Integrating Multisource Block-Wise Missing Data in Model Selection , 2019, Journal of the American Statistical Association.

[24]  Craig K. Enders,et al.  An introduction to modern missing data analyses. , 2010, Journal of school psychology.

[25]  Martin J. Wainwright,et al.  Restricted Eigenvalue Properties for Correlated Gaussian Designs , 2010, J. Mach. Learn. Res..

[26]  T. Cai,et al.  Estimation, Confidence Intervals, and Large-Scale Hypotheses Testing for High-Dimensional Mixed Linear Regression , 2020, 2011.03598.

[27]  A. Dalalyan,et al.  On the prediction loss of the lasso in the partially labeled setting , 2016, 1606.06179.

[28]  C. Jack,et al.  Alzheimer's Disease Neuroimaging Initiative , 2008 .

[29]  Nicola J. Rinaldi,et al.  Genetic effects on gene expression across human tissues , 2017, Nature.

[30]  S. Kasper,et al.  Prediction of Autopsy Verified Neuropathological Change of Alzheimer’s Disease Using Machine Learning and MRI , 2018, Front. Aging Neurosci..

[31]  Tianxi Cai,et al.  Efficient and adaptive linear regression in semi-supervised settings , 2017, The Annals of Statistics.

[32]  Adel Javanmard,et al.  Confidence intervals and hypothesis testing for high-dimensional regression , 2013, J. Mach. Learn. Res..

[33]  Feng Shi,et al.  Study of brain morphology change in Alzheimer’s disease and amnestic mild cognitive impairment compared with normal controls , 2019, General Psychiatry.

[34]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[35]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[36]  Yorghos Tripodis,et al.  Mini Mental State Examination and Logical Memory scores for entry into Alzheimer’s disease trials , 2016, Alzheimer's Research & Therapy.

[37]  Andreas Buja,et al.  Semi-Supervised Linear Regression , 2016, Journal of the American Statistical Association.

[38]  S. Geer,et al.  On asymptotically optimal confidence regions and tests for high-dimensional models , 2013, 1303.0518.

[39]  Jelena Bradic,et al.  High-dimensional semi-supervised learning: in search for optimal inference of the mean , 2019, 1902.00772.

[40]  Dinggang Shen,et al.  Optimal Sparse Linear Prediction for Block-missing Multi-modality Data Without Imputation , 2019, Journal of the American Statistical Association.

[41]  Latarsha J. Carithers,et al.  The Genotype-Tissue Expression (GTEx) Project. , 2015, Biopreservation and biobanking.

[42]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[43]  P. Bovolenta,et al.  Elevated levels of Secreted-Frizzled-Related-Protein 1 contribute to Alzheimer’s disease pathogenesis , 2019, Nature Neuroscience.

[44]  Paul M. Thompson,et al.  Bi-level multi-source learning for heterogeneous block-wise missing data , 2014, NeuroImage.

[45]  Cun-Hui Zhang,et al.  Confidence intervals for low dimensional parameters in high dimensional linear models , 2011, 1110.2563.

[46]  S. Mendelson,et al.  Uniform Uncertainty Principle for Bernoulli and Subgaussian Ensembles , 2006, math/0608665.

[47]  Han Liu,et al.  A General Theory of Hypothesis Tests and Confidence Regions for Sparse High Dimensional Models , 2014, 1412.8765.

[48]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[49]  I. Kohane Using electronic health records to drive discovery in disease genomics , 2011, Nature Reviews Genetics.

[50]  A. Bartoš,et al.  Brain volumes and their ratios in Alzheimer´s disease on magnetic resonance imaging segmented using Freesurfer 6.0 , 2019, Psychiatry Research: Neuroimaging.

[51]  T. Tony Cai,et al.  Semisupervised inference for explained variance in high dimensional linear regression and its applications , 2018, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[52]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[53]  Xinyuan Song,et al.  Bayesian hidden Markov models for delineating the pathology of Alzheimer’s disease , 2019, Statistical methods in medical research.