Missing Value Imputation Using Stratified Supervised Learning for Cardiovascular Data

Legacy (and current) medical datasets are rich source of information and knowledge. However, the use of most legacy medical datasets is beset with problems. One of the most often faced is the problem of missing data, often due to oversights in data capture or data entry procedures. Algorithms commonly used in the analysis of data often depend on a complete data set. Missing value imputation offers a solution to this problem. This may result in the generation of synthetic data, with artificially induced missing values, but simply removing the incomplete data records often produces the best classifier results. With legacy data, simply removing the records from the original datasets can significantly reduce the data volume and often affect the class balance of the dataset. A suitable method for missing value imputation is very much needed to produce good quality datasets for better analysing data resulting from clinical trials. This paper proposes a framework for missing value imputation using stratified machine learning methods. We explore machine learning technique to predict missing value for incomplete clinical (cardiovascular) data, with experiments comparing this with other standard methods. Two machine learning (classifier) algorithms, fuzzy unordered rule induction algorithm and decision tree, plus other machine learning algorithms (for comparison purposes) are used to train on complete data and subsequently predict missing values for incomplete data. The complete datasets are classified using decision tree, neural network, K-NN and K-Mean clustering. The classification performances are evaluated using sensitivity, specificity, accuracy, positive predictive value and negative predictive value. The results show that final classifier performance can be significantly improved for all class labels when stratification was used with fuzzy unordered rule induction algorithm to predict missing attribute values.

[1]  Anatole Lécuyer,et al.  FuRIA: An Inverse Solution Based Feature Extraction Algorithm Using Fuzzy Set Theory for Brain–Computer Interfaces , 2009, IEEE Transactions on Signal Processing.

[2]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[3]  Sun Jiang'hong,et al.  Large Rotating Machinery Fault Diagnosis and Knowledge Rules Acquiring Based on Improved RIPPER , 2009, 2009 Second International Conference on Intelligent Computation Technology and Automation.

[4]  B. Arnaldi,et al.  FuRIA: A Novel Feature Extraction Algorithm for Brain-Computer Interfaces using Inverse Models and Fuzzy Regions of Interest , 2007, 2007 3rd International IEEE/EMBS Conference on Neural Engineering.

[5]  Craig K. Enders,et al.  Missing Data in Educational Research: A Review of Reporting Practices and Suggestions for Improvement , 2004 .

[6]  Bingru Yang,et al.  A SVM Regression Based Approach to Filling in Missing Values , 2005, KES.

[7]  Thuy Thi Thu Nguyen Predicting cardiovascular risks using pattern recognition and data mining , 2009 .

[8]  M. Shipley,et al.  CORONARY-HEART-DISEASE RISK AND IMPAIRED GLUCOSE TOLERANCE The Whitehall Study , 1980, The Lancet.

[9]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[10]  George A. Tsihrintzis,et al.  Addressing the Class Imbalance Problem , 2017 .

[11]  M. Mostafizur Rahman,et al.  Machine learning based data pre-processing for the purpose of medical data mining and decision support , 2014 .

[12]  Quan Pan,et al.  Adaptive imputation of missing values for incomplete pattern classification , 2016, Pattern Recognit..

[13]  Harri Niska,et al.  Methods for imputation of missing values in air quality data sets , 2004 .

[14]  Ming Zhong,et al.  Evolutionary Regression and Neural Imputations of Missing Values , 2008, Soft Computing Applications in Industry.

[15]  Robert Lemery,et al.  Effect of the antiarrhythmic agent moricizine on survival after myocardial infarction. , 1992, The New England journal of medicine.

[16]  Alexander Romanov,et al.  Long-term ECG monitoring using an implantable loop recorder for the detection of atrial fibrillation after cavotricuspid isthmus ablation in patients with atrial flutter. , 2013, Heart rhythm.

[17]  Zhang Xin-yan,et al.  Research on the missing attribute value data-oriented for decision tree , 2010, 2010 2nd International Conference on Signal Processing Systems.

[18]  Guillaume Pouliot Missing Data Problems , 2016 .

[19]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[20]  M. Daabiss,et al.  American Society of Anaesthesiologists physical status classification , 2011, Indian journal of anaesthesia.

[21]  Sholom M. Weiss,et al.  Decision-Rule Solutions for Data Mining with Missing Values , 2000, IBERAMIA-SBIA.

[22]  M. Mostafizur Rahman,et al.  SEMI SUPERVISED UNDER-SAMPLING: A SOLUTION TO THE CLASS IMBALANCE PROBLEM FOR CLASSIFICATION AND FEATURE SELECTION , 2014 .

[23]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[24]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[25]  Wang Ling,et al.  Estimation of Missing Values Using a Weighted K-Nearest Neighbors Algorithm , 2009, 2009 International Conference on Environmental Science and Information Application Technology.

[26]  Durga Toshniwal,et al.  Missing Value Imputation Method Based on Clustering and Nearest Neighbours , 2012 .

[27]  Leonardo Franco,et al.  Missing data imputation using statistical and machine learning methods in a real breast cancer problem , 2010, Artif. Intell. Medicine.

[28]  Wojciech Zareba,et al.  Spectrum of ST-T–Wave Patterns and Repolarization Parameters in Congenital Long-QT Syndrome: ECG Findings Identify Genotypes , 2000, Circulation.

[29]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .