Feature projection k-NN classifier model for imbalanced and incomplete medical data

Abstract Many datasets, especially various historical medical data are incomplete. Various qualities of data can significantly hamper medical diagnosis and are bottlenecks of medical support systems. Nowadays, such systems are often used in medical diagnosis. Even great number of data can be unsuitable when data is imbalanced, missing or corrupted. In some cases these troubles can be overcome by machine learning algorithms designed for predictive modeling. Proposed approach was tested on real medical data and some benchmarks dataset form UCI repository. The liver fibrosis disease from a medical point of view is difficult to treatment and has a significant social and economic impact. Stages of liver fibrosis are diagnosed by clinical observation and evaluations, coupled with a so-called METAVIR rating scale. However, these methods may be insufficient, especially in the recognition of phase of the disease. This paper describes a newly developed algorithm to non-invasive fibrosis stage recognition using machine learning methods – a classification model based on feature projection k -NN classifier. This solution allows extracting data characteristics from the historical data which may be incomplete and may contain imbalance (unequal) sets of patients. Proposed novel solution is based on peripheral blood analysis without using any specialized biomarkers, and can be successfully included to medical diagnosis support systems and might be a powerful tool for effective estimation of liver fibrosis stages.

[1]  Jing Yang,et al.  An Improved Random Forest Algorithm for Class-Imbalanced Data Classification and its Application in PAD Risk Factors Analysis , 2013 .

[2]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[3]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[4]  Jerzy Stefanowski,et al.  Addressing imbalanced data with argument based rule learning , 2015, Expert Syst. Appl..

[5]  Janusz Jezewski,et al.  Computerized analysis of fetal heart rate signals as the predictor of neonatal acidemia , 2012, Expert Syst. Appl..

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  Eamonn J. Keogh,et al.  Curse of Dimensionality , 2010, Encyclopedia of Machine Learning.

[8]  Marek Kurzynski,et al.  Hetero- and Homogeneous Multiclassifier Systems Based on Competence Measure Applied to the Recognition of Hand Grasping Movements , 2014 .

[9]  Levent Özgür,et al.  Text Categorization with Class-Based and Corpus-Based Keyword Selection , 2005, ISCIS.

[10]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[11]  Eugene R. Schiff,et al.  Sampling error and intraobserver variation in liver biopsy in patients with chronic HCV infection , 2002, American Journal of Gastroenterology.

[12]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[13]  Robert Koprowski,et al.  Machine learning, medical diagnosis, and biomedical engineering research - commentary , 2014, BioMedical Engineering OnLine.

[14]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[15]  V. Paradis,et al.  Sampling variability of liver fibrosis in chronic hepatitis C , 2003, Hepatology.

[16]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[17]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[18]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[19]  Piotr Porwik,et al.  Investigation of the Impact of Missing Value Imputation Methods on the k-NN Classification Accuracy , 2015, ICCCI.

[20]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[21]  Piotr Porwik,et al.  Liver Fibrosis Diagnosis Support System Using Machine Learning Methods , 2015, ACSS.

[22]  Piotr Porwik,et al.  A Computational Assessment of a Blood Vessel's Compliance: A Procedure Based on Computed Tomography Coronary Angiography , 2011, HAIS.

[23]  Piotr Porwik,et al.  Medical diagnosis support system based on the ensemble of single-parameter classifiers , 2014 .

[24]  Gerald Schaefer,et al.  Cost-sensitive decision tree ensembles for effective imbalanced classification , 2014, Appl. Soft Comput..