Imputation Methods Outperform Missing-Indicator for Data Missing Completely at Random

Missing data is a ubiquitous cross-domain problem persistent in the context of big data analytics. Approaches to deal with missing data can be partitioned into methods that impute substitute values and methods that introduce missing-indicator variables. In this work, we demonstrate that the missing-indicator method underperforms compared to any of the other imputation methods. Most studies either focus on minimizing the squared error for the imputed values or use the missing-indicator in machine learning tasks as an assumed best practice. We study the difference between the missing-indicator method and various imputation methods on classifier learning performance when data are missing completely at random (MCAR). We compute the classifier performance over 22 complete classification datasets of varying sample size and dimensionality from an open data repository, simulating synthetic missingness at different percentages. We compare classifier performances yielded by applying mean, median, linear regression, and tree-based regression imputation methods with the corresponding performances yielded by applying the missing-indicator approach. The impact is measured with respect to three different classifiers, namely a tree-based ensemble classifier, radial basis function support vector machine classifier and k-nearest neighbours classifier. With these experiments, we conclude that given a classification problem with missing numerical data under MCAR, the missing-indicator method provides a performance decrease and should be, therefore, dismissed as a missing data-handling approach in the MCAR scenario.

[1]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[2]  Jack E. Olson,et al.  Data Quality: The Accuracy Dimension , 2003 .

[3]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[4]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[5]  Mehran Amiri,et al.  Missing data imputation using fuzzy-rough methods , 2016, Neurocomputing.

[6]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[7]  Dimitris Bertsimas,et al.  From Predictive Methods to Missing Data Imputation: An Optimization Approach , 2017, J. Mach. Learn. Res..

[8]  Yanchi Liu,et al.  Imputing Missing Values for Mixed Numeric and Categorical Attributes Based on Incomplete Data Hierarchical Clustering , 2011, KSEM.

[9]  A Rogier T Donders,et al.  Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. , 2006, Journal of clinical epidemiology.

[10]  Bhekisipho Twala,et al.  AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES , 2009, Appl. Artif. Intell..

[11]  R. Little A Test of Missing Completely at Random for Multivariate Data with Missing Values , 1988 .

[12]  Zhongheng Zhang,et al.  Missing data imputation: focusing on single imputation. , 2016, Annals of translational medicine.

[13]  Pedro Abreu,et al.  Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values , 2015, Comput. Biol. Medicine.

[14]  Tra My Pham,et al.  Missing data and multiple imputation in clinical epidemiological research , 2017, Clinical epidemiology.

[15]  Mike English,et al.  Handling missing data in propensity score estimation in comparative effectiveness evaluations: a systematic review , 2017, Journal of comparative effectiveness research.

[16]  Franco Peracchi,et al.  Regression with Imputed Covariates: A Generalized Missing Indicator Approach , 2009 .

[17]  Jungyeon Choi,et al.  A comparison of different methods to handle missing data in the context of propensity score analysis , 2018, European Journal of Epidemiology.

[18]  Travis E. Oliphant,et al.  Python for Scientific Computing , 2007, Computing in Science & Engineering.

[19]  Carolin Strobl,et al.  Random Forests with Missing Values in the Covariates , 2010 .

[20]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[21]  Karel G M Moons,et al.  Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis , 2012, Canadian Medical Association Journal.

[22]  Michael G. Kenward,et al.  Multiple Imputation and its Application , 2013 .

[23]  Stef van Buuren,et al.  Flexible Imputation of Missing Data , 2012 .

[24]  Noel A. Card,et al.  Best practices for missing data management in counseling psychology. , 2010, Journal of counseling psychology.

[25]  Miriam Seoane Santos,et al.  Generating Synthetic Missing Data: A Review by Missing Mechanism , 2019, IEEE Access.

[26]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[27]  A Rogier T Donders,et al.  Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. , 2010, Journal of clinical epidemiology.

[28]  D. Bennett How can I deal with missing data in my study? , 2001, Australian and New Zealand journal of public health.

[29]  Jeffrey S. Simonoff,et al.  An Investigation of Missing Data Methods for Classification Trees , 2006, J. Mach. Learn. Res..

[30]  Todd D Little,et al.  On the joys of missing data. , 2014, Journal of pediatric psychology.

[31]  Roberto Santana,et al.  An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers , 2017, Expert Syst. Appl..