Missing data imputation using statistical and machine learning methods in a real breast cancer problem

OBJECTIVES Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. This work evaluates the performance of several statistical and machine learning imputation methods that were used to predict recurrence in patients in an extensive real breast cancer data set. MATERIALS AND METHODS Imputation methods based on statistical techniques, e.g., mean, hot-deck and multiple imputation, and machine learning techniques, e.g., multi-layer perceptron (MLP), self-organisation maps (SOM) and k-nearest neighbour (KNN), were applied to data collected through the "El Álamo-I" project, and the results were then compared to those obtained from the listwise deletion (LD) imputation method. The database includes demographic, therapeutic and recurrence-survival information from 3679 women with operable invasive breast cancer diagnosed in 32 different hospitals belonging to the Spanish Breast Cancer Research Group (GEICAM). The accuracies of predictions on early cancer relapse were measured using artificial neural networks (ANNs), in which different ANNs were estimated using the data sets with imputed missing values. RESULTS The imputation methods based on machine learning algorithms outperformed imputation statistical methods in the prediction of patient outcome. Friedman's test revealed a significant difference (p=0.0091) in the observed area under the ROC curve (AUC) values, and the pairwise comparison test showed that the AUCs for MLP, KNN and SOM were significantly higher (p=0.0053, p=0.0048 and p=0.0071, respectively) than the AUC from the LD-based prognosis model. CONCLUSION The methods based on machine learning techniques were the most suited for the imputation of missing values and led to a significant enhancement of prognosis accuracy compared to imputation methods based on statistical procedures.

[1]  Adriana Pérez,et al.  Use of the mean, hot deck and multiple imputation techniques to predict outcome in intensive care unit patients in Colombia , 2002, Statistics in medicine.

[2]  Graeme Hawthorne,et al.  Imputing cross-sectional missing data: comparison of common techniques. , 2005, The Australian and New Zealand journal of psychiatry.

[3]  Brian D. Ripley,et al.  Clinical applications of artificial neural networks: Neural networks as statistical methods in survival analysis , 2001 .

[4]  David A. Elizondo,et al.  Current Trends on Knowledge Extraction and Neural Networks , 2005, ICANN.

[5]  D. Rubin,et al.  Fully conditional specification in multivariate imputation , 2006 .

[6]  Yoshiyuki Matsui,et al.  Predicting disease outcome of non‐invasive transitional cell carcinoma of the urinary bladder using an artificial neural network model: Results of patient follow‐up for 15 years or longer , 2003, International journal of urology : official journal of the Japanese Urological Association.

[7]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[8]  G. Kalton IMPUTING FOR MISSING SURVEY RESPONSES , 2002 .

[9]  Michel Verleysen,et al.  K nearest neighbours with mutual information for simultaneous classification and missing data imputation , 2009, Neurocomputing.

[10]  José Antonio Gómez-Ruiz,et al.  A combined neural network and decision trees model for prognosis of breast cancer relapse , 2003, Artif. Intell. Medicine.

[11]  Aníbal R. Figueiras-Vidal,et al.  Combining Missing Data Imputation and Pattern Classification in a Multi-Layer Perceptron , 2009, Intell. Autom. Soft Comput..

[12]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[13]  Gustavo E. A. P. A. Batista,et al.  A Study of K-Nearest Neighbour as an Imputation Method , 2002, HIS.

[14]  Peter K. Sharpe,et al.  Dealing with missing values in neural network-based diagnostic systems , 1995, Neural Computing & Applications.

[15]  Sophie Midenet,et al.  Self-Organising Map for Data Imputation and Correction in Surveys , 2002, Neural Computing & Applications.

[16]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[17]  Ida Scheel,et al.  The influence of missing value imputation on detection of differentially expressed genes from microarray data , 2005, Bioinform..

[18]  C. Ake Rounding After Multiple Imputation With Non-binary Categorical Covariates , 2005 .

[19]  “ Multiple Imputation in Practice : Comparison of Software Packages for Regression Models With Missing Variables , ” , 2002 .

[20]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[21]  John L.P. Thompson,et al.  Missing data , 2004, Amyotrophic lateral sclerosis and other motor neuron disorders : official publication of the World Federation of Neurology, Research Group on Motor Neuron Diseases.

[22]  Estevam R. Hruschka,et al.  Towards Efficient Imputation by Nearest-Neighbors: A Clustering-Based Approach , 2004, Australian Conference on Artificial Intelligence.

[23]  Tariq Samad,et al.  Self–organization with partial data , 1992 .

[24]  J. Kaiser The Effectiveness of Hot-deck Procedures in Small Samples. , 1983 .

[25]  S. Lipsitz,et al.  Missing-Data Methods for Generalized Linear Models , 2005 .

[26]  I C G Campbell,et al.  Constructive learning techniques for designing neural network systems , 1998 .

[27]  A. Lluch,et al.  [Epidemiological study of the GEICAM group about breast cancer in Spain (1990-1993): El Alamo project]. , 2004, Medicina clinica.

[28]  Paulo J. G. Lisboa,et al.  The Use of Artificial Neural Networks in Decision Support in Cancer: a Systematic Review , 2005 .

[29]  Danh V. Nguyen,et al.  Evaluation of Missing Value Estimation for Microarray Data , 2004, Journal of Data Science.

[30]  A. Lluch,et al.  Spanish Breast Cancer Research Group (GEICAM) population-based study on breast cancer outcomes: El Alamo project (1990-1997) , 2005 .

[31]  Guy N. Brock,et al.  Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes , 2008, BMC Bioinformatics.

[32]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[33]  J. M. Jerez,et al.  Improvement of breast cancer relapse prediction in high risk intervals using artificial neural networks , 2005, Breast Cancer Research and Treatment.

[34]  Charles F. Manski,et al.  Partial identification with missing data: concepts and findings , 2005, Int. J. Approx. Reason..

[35]  May,et al.  [Wiley Series in Probability and Statistics] Applied Survival Analysis (Regression Modeling of Time-to-Event Data) || Extensions of the Proportional Hazards Model , 2008 .

[36]  Amit Gupta,et al.  Estimating Missing Values Using Neural Networks , 1996 .

[37]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[38]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[39]  Yamashita,et al.  Backpropagation algorithm which varies the number of hidden units , 1989 .

[40]  Thomas R Belin,et al.  Multiple imputation using an iterative hot‐deck with distance‐based donor selection , 2008, Statistics in medicine.

[41]  Rudy Setiono,et al.  Feedforward Neural Network Construction Using Cross Validation , 2001, Neural Computation.

[42]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[43]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[44]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[45]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[46]  F. Harrell,et al.  Artificial neural networks improve the accuracy of cancer survival prediction , 1997, Cancer.

[47]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[48]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[49]  D.,et al.  Regression Models and Life-Tables , 2022 .

[50]  Miguel Martín,et al.  Estudio epidemiológico del grupo GEICAM sobre el cáncer de mama en España (1990-1993): proyecto «El Álamo» , 2004 .

[51]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[52]  Michael G Kenward,et al.  Multiple imputation: current perspectives , 2007, Statistical methods in medical research.

[53]  Paulo J. G. Lisboa,et al.  A Bayesian neural network approach for modelling censored data with an application to prognosis after surgery for breast cancer , 2003, Artif. Intell. Medicine.

[54]  E Biganzoli,et al.  Feed forward neural networks for the analysis of censored survival data: a partial logistic regression approach. , 1998, Statistics in medicine.

[55]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[56]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[57]  David G. Stork,et al.  Pattern Classification , 1973 .

[58]  P. Royston,et al.  Patrick Royston model with a binary outcome A comparison of imputation techniques for handling missing predictor values in a risk , 2007 .