论文信息 - Missing data imputation using statistical and machine learning methods in a real breast cancer problem

Missing data imputation using statistical and machine learning methods in a real breast cancer problem

OBJECTIVES Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. This work evaluates the performance of several statistical and machine learning imputation methods that were used to predict recurrence in patients in an extensive real breast cancer data set. MATERIALS AND METHODS Imputation methods based on statistical techniques, e.g., mean, hot-deck and multiple imputation, and machine learning techniques, e.g., multi-layer perceptron (MLP), self-organisation maps (SOM) and k-nearest neighbour (KNN), were applied to data collected through the "El Álamo-I" project, and the results were then compared to those obtained from the listwise deletion (LD) imputation method. The database includes demographic, therapeutic and recurrence-survival information from 3679 women with operable invasive breast cancer diagnosed in 32 different hospitals belonging to the Spanish Breast Cancer Research Group (GEICAM). The accuracies of predictions on early cancer relapse were measured using artificial neural networks (ANNs), in which different ANNs were estimated using the data sets with imputed missing values. RESULTS The imputation methods based on machine learning algorithms outperformed imputation statistical methods in the prediction of patient outcome. Friedman's test revealed a significant difference (p=0.0091) in the observed area under the ROC curve (AUC) values, and the pairwise comparison test showed that the AUCs for MLP, KNN and SOM were significantly higher (p=0.0053, p=0.0048 and p=0.0071, respectively) than the AUC from the LD-based prognosis model. CONCLUSION The methods based on machine learning techniques were the most suited for the imputation of missing values and led to a significant enhancement of prognosis accuracy compared to imputation methods based on statistical procedures.

[1] Adriana Pérez,et al. Use of the mean, hot deck and multiple imputation techniques to predict outcome in intensive care unit patients in Colombia , 2002, Statistics in medicine.

[2] Graeme Hawthorne,et al. Imputing cross-sectional missing data: comparison of common techniques. , 2005, The Australian and New Zealand journal of psychiatry.

[3] Brian D. Ripley,et al. Clinical applications of artificial neural networks: Neural networks as statistical methods in survival analysis , 2001 .

[4] David A. Elizondo,et al. Current Trends on Knowledge Extraction and Neural Networks , 2005, ICANN.

[5] D. Rubin,et al. Fully conditional specification in multivariate imputation , 2006 .

[6] Yoshiyuki Matsui,et al. Predicting disease outcome of non‐invasive transitional cell carcinoma of the urinary bladder using an artificial neural network model: Results of patient follow‐up for 15 years or longer , 2003, International journal of urology : official journal of the Japanese Urological Association.

[7] David G. Stork,et al. Pattern Classification (2nd ed.) , 1999 .

[8] G. Kalton. IMPUTING FOR MISSING SURVEY RESPONSES , 2002 .

[9] Michel Verleysen,et al. K nearest neighbours with mutual information for simultaneous classification and missing data imputation , 2009, Neurocomputing.

[10] José Antonio Gómez-Ruiz,et al. A combined neural network and decision trees model for prognosis of breast cancer relapse , 2003, Artif. Intell. Medicine.

[11] Aníbal R. Figueiras-Vidal,et al. Combining Missing Data Imputation and Pattern Classification in a Multi-Layer Perceptron , 2009, Intell. Autom. Soft Comput..

[12] Christopher M. Bishop,et al. Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[13] Gustavo E. A. P. A. Batista,et al. A Study of K-Nearest Neighbour as an Imputation Method , 2002, HIS.

[14] Peter K. Sharpe,et al. Dealing with missing values in neural network-based diagnostic systems , 1995, Neural Computing & Applications.

[15] Sophie Midenet,et al. Self-Organising Map for Data Imputation and Correction in Surveys , 2002, Neural Computing & Applications.

[16] Russ B. Altman,et al. Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[17] Ida Scheel,et al. The influence of missing value imputation on detection of differentially expressed genes from microarray data , 2005, Bioinform..

[18] C. Ake. Rounding After Multiple Imputation With Non-binary Categorical Covariates , 2005 .

[19] “ Multiple Imputation in Practice : Comparison of Software Packages for Regression Models With Missing Variables , ” , 2002 .

[20] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[21] John L.P. Thompson,et al. Missing data , 2004, Amyotrophic lateral sclerosis and other motor neuron disorders : official publication of the World Federation of Neurology, Research Group on Motor Neuron Diseases.

[22] Estevam R. Hruschka,et al. Towards Efficient Imputation by Nearest-Neighbors: A Clustering-Based Approach , 2004, Australian Conference on Artificial Intelligence.

[23] Tariq Samad,et al. Self–organization with partial data , 1992 .

[24] J. Kaiser. The Effectiveness of Hot-deck Procedures in Small Samples. , 1983 .

[25] S. Lipsitz,et al. Missing-Data Methods for Generalized Linear Models , 2005 .

[26] I C G Campbell,et al. Constructive learning techniques for designing neural network systems , 1998 .

[27] A. Lluch,et al. [Epidemiological study of the GEICAM group about breast cancer in Spain (1990-1993): El Alamo project]. , 2004, Medicina clinica.

[28] Paulo J. G. Lisboa,et al. The Use of Artificial Neural Networks in Decision Support in Cancer: a Systematic Review , 2005 .

[29] Danh V. Nguyen,et al. Evaluation of Missing Value Estimation for Microarray Data , 2004, Journal of Data Science.

[30] A. Lluch,et al. Spanish Breast Cancer Research Group (GEICAM) population-based study on breast cancer outcomes: El Alamo project (1990-1997) , 2005 .

[31] Guy N. Brock,et al. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes , 2008, BMC Bioinformatics.

[32] D. Rubin. Multiple Imputation After 18+ Years , 1996 .

[33] J. M. Jerez,et al. Improvement of breast cancer relapse prediction in high risk intervals using artificial neural networks , 2005, Breast Cancer Research and Treatment.