Strategy to Managing Mixed Datasets with Missing Items

The paper refers to the problem of decision making and choosing appropriate ways for decreasing the level of input information uncertainty related to absence or unavailability some values of mixed data sets. Approaches to addressing missing data and evaluating their performance are discussed. The generalized strategy to managing data with missing values is proposed. The study based on real pregnancy-related records of 186 patients from 12 to 42 weeks of gestation. Three missing data techniques: complete ignoring, case deletion, and random forest (RF) missing data imputation were applied to the medical data of various types, under a missing completely at random assumption for solving classification task and softening the negative impact of input information uncertainty. The efficiency of approaches to deal with missingness was evaluated. Results demonstrated that case deletion and ignoring missing values were the less suitable to handle mixed types of missing data and suggested RF imputation as a useful approach for imputing complex pregnancy-related data sets with missing data.

[1]  Ayse Basar Bener,et al.  An algorithmic approach to missing data problem in modeling human aspects in software development , 2013, PROMISE.

[2]  Craig K. Enders,et al.  Missing Data in Educational Research: A Review of Reporting Practices and Suggestions for Improvement , 2004 .

[3]  Mark Huisman,et al.  Imputation of missing network data: Some simple procedures , 2009, J. Soc. Struct..

[4]  D. Rubin,et al.  Multiple Imputation for Nonresponse in Surveys , 1989 .

[5]  Larry J. Eshelman,et al.  A dynamic ensemble approach to robust classification in the presence of missing data , 2015, Machine Learning.

[6]  Judi Scheffer,et al.  Dealing with Missing Data , 2020, The Big R‐Book.

[7]  J. Carpenter,et al.  Practice of Epidemiology Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study , 2014 .

[8]  Fei Tang,et al.  Random forest missing data algorithms , 2017, Stat. Anal. Data Min..

[9]  Jehanzeb R. Cheema Regular Articles: Some General Guidelines for Choosing Missing Data Handling Methods in Educational Research , 2014 .

[10]  Aníbal R. Figueiras-Vidal,et al.  Classification with Incomplete Data , 2010 .

[11]  Ilya Safro,et al.  Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values , 2016, PloS one.

[12]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[13]  James C Doidge Responsiveness-informed multiple imputation and inverse probability-weighting in cohort studies with missing data that are non-monotone or not missing at random , 2018, Statistical methods in medical research.

[14]  Yelipe UshaRani,et al.  An efficient approach for imputation and classification of medical data values using class-based clustering of medical records , 2017, Comput. Electr. Eng..

[15]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[16]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[17]  Lukasz A. Kurgan,et al.  Impact of imputation of missing values on classification error for discrete data , 2008, Pattern Recognit..

[18]  Ying-Zi Fu Stochastic EM algorithm of a finite mixture model from hurdle Poisson distribution with missing responses , 2016 .

[19]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[20]  Mickael Guedj,et al.  A Comparison of Six Methods for Missing Data Imputation , 2015 .

[21]  W. Holmes FinchMaria E. Hernández Finch Imputation Methods for Missing Categorical Questionnaire Data: A Comparison of Approaches , 2021, Journal of Data Science.

[22]  Yulei He,et al.  Missing data analysis using multiple imputation: getting to the heart of the matter. , 2010, Circulation. Cardiovascular quality and outcomes.

[23]  Joseph G. Ibrahim,et al.  Missing data methods in longitudinal studies: a review , 2009 .