Integrating Multisource Block-Wise Missing Data in Model Selection

For multi-source data, blocks of variable information from certain sources are likely missing. Existing methods for handling missing data do not take structures of block-wise missing data into consideration. In this paper, we propose a Multiple Block-wise Imputation (MBI) approach, which incorporates imputations based on both complete and incomplete observations. Specifically, for a given missing pattern group, the imputations in MBI incorporate more samples from groups with fewer observed variables in addition to the group with complete observations. We propose to construct estimating equations based on all available information, and optimally integrate informative estimating functions to achieve efficient estimators. We show that the proposed method has estimation and model selection consistency under both fixed-dimensional and high-dimensional settings. Moreover, the proposed estimator is asymptotically more efficient than the estimator based on a single imputation from complete observations only. In addition, the proposed method is not restricted to missing completely at random. Numerical studies and ADNI data application confirm that the proposed method outperforms existing variable selection methods under various missing mechanisms.

[1]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[2]  Jianqing Fan,et al.  Endogeneity in High Dimensions. , 2012, Annals of statistics.

[3]  Stephen B. Soumerai,et al.  Missing clinical and behavioral health data in a large electronic health record (EHR) system , 2016, J. Am. Medical Informatics Assoc..

[4]  Dinggang Shen,et al.  Optimal Sparse Linear Prediction for Block-missing Multi-modality Data Without Imputation , 2019, Journal of the American Statistical Association.

[5]  A. C. Berry The accuracy of the Gaussian approximation to the sum of independent variates , 1941 .

[6]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[7]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[8]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[9]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[10]  W. Loh,et al.  Classification and regression tree methods for incomplete data from sample surveys , 2016, 1603.01631.

[11]  Feng Shi,et al.  Study of brain morphology change in Alzheimer’s disease and amnestic mild cognitive impairment compared with normal controls , 2019, General Psychiatry.

[12]  B. Tang Enhancing α-secretase Processing for Alzheimer’s Disease—A View on SFRP1 , 2020, Brain sciences.

[13]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[14]  C. Jack,et al.  Alzheimer's Disease Neuroimaging Initiative , 2008 .

[15]  S. Datta,et al.  Variable selection models based on multiple imputation with an application for predicting median effective dose and maximum effect , 2015, Journal of statistical computation and simulation.

[16]  Patrick Royston,et al.  How should variable selection be performed with multiply imputed data? , 2008, Statistics in medicine.

[17]  V. Chan‐Palay,et al.  Increased monoamine oxidase b activity in plaque-associated astrocytes of Alzheimer brains revealed by quantitative enzyme radioautography , 1994, Neuroscience.

[18]  I. White,et al.  Review of inverse probability weighting for dealing with missing data , 2013, Statistical methods in medical research.

[19]  P. Bovolenta,et al.  Elevated levels of Secreted-Frizzled-Related-Protein 1 contribute to Alzheimer’s disease pathogenesis , 2019, Nature Neuroscience.

[20]  Amity E. Green,et al.  Hippocampal Atrophy and Ventricular Enlargement in Normal Aging, Mild Cognitive Impairment (MCI), and Alzheimer Disease , 2012, Alzheimer disease and associated disorders.

[21]  Paul M. Thompson,et al.  Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data , 2012, NeuroImage.

[22]  Xinyuan Song,et al.  Bayesian hidden Markov models for delineating the pathology of Alzheimer’s disease , 2019, Statistical methods in medical research.

[23]  A. I. Cohen Rate of convergence of several conjugate gradient algorithms. , 1972 .

[24]  Efficient estimation for longitudinal data by combining large-dimensional moment conditions , 2015 .

[25]  G. V. Van Hoesen,et al.  The Parahippocampal Gyrus in Alzheimer's Disease: Clinical and Preclinical Neuroanatomical Correlates , 2000, Annals of the New York Academy of Sciences.

[26]  Zhongheng Zhang,et al.  Missing data imputation: focusing on single imputation. , 2016, Annals of translational medicine.

[27]  C. Jack,et al.  Ways toward an early diagnosis in Alzheimer’s disease: The Alzheimer’s Disease Neuroimaging Initiative (ADNI) , 2005, Alzheimer's & Dementia.

[28]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[29]  Anru Zhang,et al.  Structured Matrix Completion with Applications to Genomic Data Integration , 2015, Journal of the American Statistical Association.

[30]  Mihye Ahn,et al.  Spatially Weighted Principal Component Analysis for Imaging Classification , 2015, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[31]  Kaj Blennow,et al.  Cerebrospinal fluid protein biomarkers for Alzheimer’s disease , 2004, NeuroRX.

[32]  R. Dury,et al.  REDUCED SUPRAMARGINAL GYRUS GRAY MATTER VOLUME ASSOCIATED WITH COGNITIVE IMPAIRMENT IN ALZHEIMER’S DISEASE: A 7-TESLA MRI STUDY , 2016, Alzheimer's & Dementia.

[33]  A. Convit,et al.  Atrophy of the medial occipitotemporal, inferior, and middle temporal gyri in non-demented elderly predict decline to Alzheimer’s disease☆ , 2000, Neurobiology of Aging.

[34]  Cun-Hui Zhang,et al.  Adaptive Lasso for sparse high-dimensional regression models , 2008 .

[35]  Sijian Wang,et al.  Variable Selection for Multiply-imputed Data with Application to Dioxin Exposure Study Variable Selection for Multiply-imputed Data , 2011 .

[36]  Yang Feng,et al.  VARIABLE SELECTION AND PREDICTION WITH INCOMPLETE HIGH-DIMENSIONAL DATA. , 2016, The annals of applied statistics.

[37]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[38]  Craig K. Enders,et al.  An introduction to modern missing data analyses. , 2010, Journal of school psychology.

[39]  Jianqing Fan,et al.  Nonconcave Penalized Likelihood With NP-Dimensionality , 2009, IEEE Transactions on Information Theory.

[40]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[41]  S. Folstein,et al.  "Mini-mental state". A practical method for grading the cognitive state of patients for the clinician. , 1975, Journal of psychiatric research.

[42]  Ya-Xiang Yuan,et al.  A Nonlinear Conjugate Gradient Method with a Strong Global Convergence Property , 1999, SIAM J. Optim..

[43]  Olivier Piguet,et al.  On the right side? A longitudinal study of left- versus right-lateralized semantic dementia. , 2016, Brain : a journal of neurology.

[44]  Willem van Mechelen,et al.  Variable selection under multiple imputation using the bootstrap in a prognostic study , 2007, BMC medical research methodology.

[45]  Mehmet Caner,et al.  LASSO-TYPE GMM ESTIMATOR , 2009, Econometric Theory.

[46]  Daniel Rueckert,et al.  Evaluating Imputation Techniques for Missing Data in ADNI: A Patient Classification Study , 2015, CIARP.

[47]  G. V. Van Hoesen,et al.  Neuropathologic changes of the temporal pole in Alzheimer's disease and Pick's disease. , 1994, Archives of neurology.

[48]  Malek Adjouadi,et al.  Significance of Normalization on Anatomical MRI Measures in Predicting Alzheimer's Disease , 2014, TheScientificWorldJournal.

[49]  T. Tombaugh,et al.  The Mini‐Mental State Examination: A Comprehensive Review , 1992, Journal of the American Geriatrics Society.

[50]  Paul M. Thompson,et al.  Bi-level multi-source learning for heterogeneous block-wise missing data , 2014, NeuroImage.

[51]  J. Raber,et al.  Apolipoprotein E–low density lipoprotein receptor interaction affects spatial memory retention and brain ApoE levels in an isoform-dependent manner , 2014, Neurobiology of Disease.

[52]  E. Goodman,et al.  Initial Results in Alzheimer's Disease Progression Modeling Using Imputed Health State Profiles , 2016, 2016 International Conference on Computational Science and Computational Intelligence (CSCI).

[53]  E. Carro,et al.  Pathological Alteration in the Choroid Plexus of Alzheimer’s Disease: Implication for New Therapy Approaches , 2012, Front. Pharmacol..

[54]  L. Hansen Large Sample Properties of Generalized Method of Moments Estimators , 1982 .

[55]  L. Fahrmeir,et al.  Correction: Consistency and Asymptotic Normality of the Maximum Likelihood Estimator in Generalized Linear Models , 1985 .

[56]  D. Morgensztern,et al.  Immune checkpoint inhibition in patients with brain metastases. , 2016, Annals of translational medicine.

[57]  B. Sahakian,et al.  Differing patterns of temporal atrophy in Alzheimer’s disease and semantic dementia , 2001, Neurology.

[58]  Joseph G. Ibrahim,et al.  Missing covariates in generalized linear models when the missing data mechanism is non‐ignorable , 1999 .

[59]  Chunshui Yu,et al.  Hippocampal volume and asymmetry in mild cognitive impairment and Alzheimer's disease: Meta‐analyses of MRI studies , 2009, Hippocampus.

[60]  Hongtu Zhu,et al.  MWPCR: Multiscale Weighted Principal Component Regression for High-Dimensional Prediction , 2017, Journal of the American Statistical Association.

[61]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[62]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[63]  Qi Gao,et al.  High-dimensional variable selection in regression and classification with missing data , 2017, Signal Process..

[64]  Hongtu Zhu,et al.  VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA. , 2010, Statistica Sinica.

[65]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[66]  Hongtu Zhu,et al.  Bayesian Sensitivity Analysis of Statistical Models with Missing Data. , 2014, Statistica Sinica.

[67]  Enola K. Proctor,et al.  Imputing Missing Data: A Comparison of Methods for Social Work Researchers , 2006 .

[68]  B T Hyman,et al.  Entorhinal cortex pathology in Alzheimer's disease , 1991, Hippocampus.

[69]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[70]  Eric J Tchetgen Tchetgen,et al.  On Inverse Probability Weighting for Nonmonotone Missing at Random Data , 2014, Journal of the American Statistical Association.

[71]  Ross L Prentice,et al.  A penalized EM algorithm incorporating missing data mechanism for Gaussian parameter estimation , 2014, Biometrics.

[72]  Qi Long,et al.  Variable selection in the presence of missing data: resampling and imputation. , 2015, Biostatistics.

[73]  Hongtu Zhu,et al.  Spatially Weighted Principal Component Regression for High-Dimensional Prediction , 2015, IPMI.

[74]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .