A Study of High-Dimensional Data Imputation Using Additive LASSO Regression Model

With the rapid growth of computational domains, bioinformatics finance, engineering, biometrics, and neuroimaging emphasize the necessity for analyzing high-dimensional data. Many real-world datasets may contain hundreds or thousands of features. The common problem in most of the knowledge-based classification problems is quality and quantity of data. In general, the common problem with many high-dimensional data samples is that it contains missing or unknown attribute values, incomplete feature vectors, and uncertain or vague data which have to be handled carefully. Due to the presence of a large segment of missing values in the datasets, refined multiple imputation methods are required to estimate the missing values so that a fair and more consistent analysis can be achieved. In this paper, three imputation (MI) methods, mean, imputations predictive mean, and imputations by additive LASSO, are employed in cloud. Results show that imputations by additive LASSO are the preferred multiple imputation (MI) method.

[1]  Naoki Sato,et al.  Incremental value of biomarkers to clinical variables for mortality prediction in acutely decompensated heart failure: the Multinational Observational Cohort on Acute Heart Failure (MOCA) study. , 2013, International journal of cardiology.

[2]  Recai M Yucel,et al.  Random covariances and mixed-effects models for imputing multivariate multilevel continuous data , 2011, Statistical modelling.

[3]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[4]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[5]  Stef van Buuren,et al.  Partioned predictive mean matching as a large data multilevel imputation technique. , 2015 .

[6]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[7]  J. R. Carpenter,et al.  Multiple imputation for IPD meta‐analysis: allowing for heterogeneity and studies with missing covariates , 2015, Statistics in medicine.

[8]  Laurence T. Yang,et al.  Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud , 2015, The Journal of Supercomputing.

[9]  Roderick Little,et al.  Calibrated Bayes, for Statistics in General, and Missing Data in Particular , 2011, 1108.1917.

[10]  Dimitris Rizopoulos,et al.  Dealing with missing covariates in epidemiologic studies: a comparison between multiple imputation and a full Bayesian approach , 2016, Statistics in medicine.

[11]  Karel G M Moons,et al.  Imputation of systematically missing predictors in an individual participant data meta‐analysis: a generalized approach using MICE , 2015, Statistics in medicine.

[12]  J. Schafer Multiple imputation: a primer , 1999, Statistical methods in medical research.

[13]  R. Little A Test of Missing Completely at Random for Multivariate Data with Missing Values , 1988 .

[14]  D B Rubin,et al.  Multiple imputation in health-care databases: an overview and some applications. , 1991, Statistics in medicine.

[15]  J L Schafer,et al.  Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. , 1998, Multivariate behavioral research.

[16]  Andrew Gelman,et al.  Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches , 2014, Political Analysis.

[17]  Dean Langan,et al.  Comparative performance of heterogeneity variance estimators in meta‐analysis: a review of simulation studies , 2016, Research synthesis methods.