Handling high-dimensional data with missing values by modern machine learning techniques

High-dimensional data have been regarded as one of the most important types of big data in practice. It happens frequently in practice including genetic study, financial study, and geographical study. Missing data in high dimensional data analysis should be handled properly to reduce nonresponse bias. We discuss some modern machine learning techniques including penalized regression approaches, tree-based approaches, and deep learning (DL) for handling missing data with high dimensionality. Specifically, our proposed methods can be used for estimating general parameters of interest including population means and percentiles with imputation-based estimators, propensity score estimators, and doubly robust estimators. We compare those methods through some limited simulation studies and a real application. Both simulation studies and real application show the benefits of DL and XGboost approaches compared with other methods in terms of balancing bias and variance.

[1]  Susan Shur-Fen Gau,et al.  A Deep Learning Approach for Missing Data Imputation of Rating Scales Assessing Attention-Deficit Hyperactivity Disorder , 2020, Frontiers in Psychiatry.

[2]  Jae Kwang Kim,et al.  Asymptotic theory and inference of predictive mean matching imputation using a superpopulation model framework , 2019, Scandinavian journal of statistics, theory and applications.

[3]  Christian Heumann,et al.  Multiple imputation with sequential penalized regression , 2019, Statistical methods in medical research.

[4]  D. Haziza,et al.  Pseudo-population bootstrap methods for imputed survey data. , 2019, Biometrika.

[5]  Sixia Chen,et al.  Recent Developments in Dealing with Item Non‐response in Surveys: A Critical Review , 2018, International Statistical Review.

[6]  Xiaojun Ma,et al.  Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning , 2018, Electron. Commer. Res. Appl..

[7]  Hong Zheng,et al.  A deep learning framework for imputing missing values in genomic data , 2018, bioRxiv.

[8]  A. Linero Bayesian Regression Trees for High-Dimensional Prediction and Variable Selection , 2018 .

[9]  Jae Kwang Kim,et al.  Nearest Neighbor Imputation for General Parameter Estimation in Survey Sampling , 2017, Advances in Econometrics.

[10]  Sixia Chen,et al.  Multiply robust imputation procedures for the treatment of item nonresponse in surveys , 2017 .

[11]  Yong Chen,et al.  On pseudolikelihood inference for semiparametric models with boundary problems , 2017, Biometrika.

[12]  Jae Kwang Kim,et al.  Semiparametric fractional imputation using empirical likelihood in survey sampling , 2017, Statistical theory and related fields.

[13]  Qi Long,et al.  Multiple imputation in the presence of high-dimensional data , 2016, Statistical methods in medical research.

[14]  D. Haziza,et al.  Doubly Robust Inference for the Distribution Function in the Presence of Missing Survey Data , 2016 .

[15]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[16]  David Haziza,et al.  A Discussion of Weighting Procedures for Unit Nonresponse , 2016 .

[17]  Jae Kwang Kim,et al.  Fractional Imputation in Survey Sampling: A Comparative Review , 2015, 1508.06945.

[18]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[19]  A. Gandomi,et al.  Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[20]  A. Zwinderman,et al.  Validation of prediction models based on lasso regression with multiply imputed data , 2014, BMC Medical Research Methodology.

[21]  David Suter,et al.  Fast Supervised Hashing with Decision Trees for High-Dimensional Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  John D. Storey,et al.  Statistical significance of variables driving systematic variation in high-dimensional data , 2013, Bioinform..

[23]  Jae Kwang Kim,et al.  Statistical Methods for Handling Incomplete Data , 2013 .

[24]  Lu Wang,et al.  Estimation with missing data: beyond double robustness , 2013 .

[25]  Michael R Kosorok,et al.  Recursively Imputed Survival Trees , 2012, Journal of the American Statistical Association.

[26]  Jae Kwang Kim Parametric fractional imputation for missing data analysis , 2011 .

[27]  Jerome P. Reiter,et al.  Multiple imputation for missing data via sequential regression trees. , 2010, American journal of epidemiology.

[28]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[29]  Roderick J A Little,et al.  A Review of Hot Deck Imputation for Survey Non‐response , 2010, International statistical review = Revue internationale de statistique.

[30]  Klaus Nordhausen,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman , 2009 .

[31]  A. Nobel,et al.  Finding large average submatrices in high dimensional data , 2009, 0905.1682.

[32]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[33]  James J. Chen,et al.  Ensemble methods for classification of patients for personalized medicine with high-dimensional data , 2007, Artif. Intell. Medicine.

[34]  Joseph Kang,et al.  Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data , 2007, 0804.2973.

[35]  Andrew M. Jones,et al.  Health‐related non‐response in the British Household Panel Survey and European Community Household Panel: using inverse‐probability‐weighted estimators in non‐linear models , 2006 .

[36]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[37]  Wayne A. Fuller,et al.  Fractional hot deck imputation , 2004 .

[38]  R. Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[39]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[40]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[41]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[42]  Jun Shao,et al.  Jackknife Variance Estimation for Nearest-Neighbor Imputation , 2001 .

[43]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[44]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[45]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[46]  J. Shao,et al.  Jackknife variance estimation with survey data under hot deck imputation , 1992 .

[47]  Roderick J. A. Little,et al.  Multiple Imputation for the Fatal Accident Reporting System , 1991 .

[48]  Subir Ghosh,et al.  Statistical Analysis With Missing Data , 1988 .

[49]  R. Little Missing-Data Adjustments in Large Surveys , 1988 .

[50]  D. Rubin,et al.  Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse , 1986 .

[51]  Leandro dos Santos Coelho,et al.  Ensemble approach based on bagging, boosting and stacking for short-term prediction in agribusiness time series , 2020, Appl. Soft Comput..

[52]  Sixia Chen,et al.  Multiply robust nonparametric multiple imputation for the treatment of missing data , 2019, Statistica Sinica.

[53]  et al.,et al.  Missing Data Imputation in the Electronic Health Record Using Deeply Learned Autoencoders , 2017, PSB.

[54]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[55]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[56]  Geoffrey E. Hinton,et al.  Deep Learning , 2015 .

[57]  J. Carpenter,et al.  Practice of Epidemiology Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study , 2014 .

[58]  Hironobu Fujiyoshi,et al.  Boosted random forest , 2014, 2014 International Conference on Computer Vision Theory and Applications (VISAPP).

[59]  Yanjun Qi Random Forest for Bioinformatics , 2012 .

[60]  Jae Kwang Kim,et al.  Some theory for propensity-score-adjustment estimators in survey sampling , 2012 .

[61]  Jörg Drechsler,et al.  Multiple Imputation for Nonresponse , 2011 .

[62]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[63]  Xiaochun Li,et al.  High-Dimensional Data Analysis in Cancer Research , 2009 .

[64]  L. Breiman Random Forests , 2001, Machine Learning.

[65]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2001, Springer Series in Statistics.

[66]  J. Shao,et al.  Nearest Neighbor Imputation for Survey Data , 2000 .

[67]  Michael Falk,et al.  A simple approach to the generation of uniformly distributed random variables with prescribed correlations , 1999 .

[68]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[69]  Donald B. Rubin,et al.  Statistical Matching Using File Concatenation With Adjusted Weights and Multiple Imputations , 1986 .