Imputed Factor Regression for High-dimensional Block-wise Missing Data

Block-wise missing data are becoming increasingly common in highdimensional biomedical, social, psychological, and environmental studies. As a result, we need efficient dimension-reduction methods for extracting important information for predictions under such data. Existing dimension-reduction methods and feature combinations are ineffective for handling block-wise missing data. We propose a factor-model imputation approach that targets block-wise missing data, and use an imputed factor regression for the dimension reduction and prediction. Specifically, we first perform screening to identify the important features. Then, we impute these features based on the factor model, and build a factor regression model to predict the response variable based on the imputed features. The proposed method utilizes the essential information from all observed data as a result of the factor structure of the model. Furthermore, the method remains efficient even when the proportion of block-wise missing is high. We show that the imputed factor regression model and its predictions are consistent under regularity conditions. We compare the proposed method with existing approaches using simulation studies, after which we apply it to data from the Alzheimer’s DisStatistica Sinica: Preprint doi:10.5705/ss.202018.0008

[1]  Paul M. Thompson,et al.  Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data , 2012, NeuroImage.

[2]  Xin-Yuan Song,et al.  Regression Analysis of Additive Hazards Model With Latent Variables , 2015 .

[3]  M. Rothschild,et al.  Arbitrage, Factor Structure, and Mean-Variance Analysis on Large Asset Markets , 1982 .

[4]  Trevor Hastie,et al.  Imputing Missing Data for Gene Expression Arrays , 2001 .

[5]  J. Kalbfleisch,et al.  Block-Conditional Missing at Random Models for Missing Data , 2010, 1104.2400.

[6]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[7]  J. Stock,et al.  Forecasting Using Principal Components From a Large Number of Predictors , 2002 .

[8]  Marco Lippi,et al.  The Generalized Dynamic Factor Model , 2002 .

[9]  Matteo Barigozzi,et al.  Improved penalization for determining the number of factors in approximate factor models , 2010 .

[10]  Kathryn Ziegler-Graham,et al.  Forecasting the global burden of Alzheimer’s disease , 2007, Alzheimer's & Dementia.

[11]  J. Bai,et al.  Inferential Theory for Factor Models of Large Dimensions , 2003 .

[12]  Qizhai Li,et al.  A hybrid approach for regression analysis with block missing data , 2014, Comput. Stat. Data Anal..

[13]  Jianqing Fan,et al.  Sufficient Forecasting Using Factor Models , 2014, Journal of econometrics.

[14]  Jianqing Fan,et al.  PROJECTED PRINCIPAL COMPONENT ANALYSIS IN FACTOR MODELS. , 2014, Annals of statistics.

[15]  J. Stock,et al.  Macroeconomic Forecasting Using Diffusion Indexes , 2002 .

[16]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[17]  Dinggang Shen,et al.  Neurodegenerative disease diagnosis using incomplete multi-modality data via matrix shrinkage and completion , 2014, NeuroImage.

[18]  Paul M. Thompson,et al.  Bi-level multi-source learning for heterogeneous block-wise missing data , 2014, NeuroImage.

[19]  David H. Small,et al.  Nowcasting: the real time informational content of macroeconomic data releases , 2008 .

[20]  Massimiliano Marcellino,et al.  Factor Forecasts for the UK , 2005 .

[21]  Hongtu Zhu,et al.  MWPCR: Multiscale Weighted Principal Component Regression for High-Dimensional Prediction , 2017, Journal of the American Statistical Association.

[22]  M. Rothschild,et al.  Arbitrage, Factor Structure, and Mean-Variance Analysis on Large Asset Markets , 1983 .

[23]  Seung C. Ahn,et al.  Eigenvalue Ratio Test for the Number of Factors , 2013 .

[24]  J. Bai,et al.  Principal components estimation and identification of static factors , 2013 .

[25]  Pascal Sarda,et al.  Factor models and variable selection in high-dimensional regression analysis , 2011 .

[26]  Farshid Vahid,et al.  Forecasting the Volatility of Australian Stock Returns , 2007 .

[27]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[28]  Mihye Ahn,et al.  Spatially Weighted Principal Component Analysis for Imaging Classification , 2015, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[29]  Mike West,et al.  Bayesian Regression Analysis in the "Large p, Small n" Paradigm with Application in DNA Microarray S , 2000 .

[30]  F. Dias,et al.  Determining the number of factors in approximate factor models with global and group-specific factors , 2008 .

[31]  Nick C Fox,et al.  The Alzheimer's disease neuroimaging initiative (ADNI): MRI methods , 2008, Journal of magnetic resonance imaging : JMRI.

[32]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[33]  J. Bai,et al.  Forecasting economic time series using targeted predictors , 2008 .

[34]  Dinggang Shen,et al.  View‐aligned hypergraph learning for Alzheimer's disease diagnosis with incomplete multi‐modality data , 2017, Medical Image Anal..

[35]  Hal Daumé,et al.  The Infinite Hierarchical Factor Regression Model , 2008, NIPS.