Classifying highly imbalanced ICU data

Highly imbalanced data sets are those where the class of interest is rare. In this paper, we compare the performance of several common data mining methods, logistic regression, discriminant analysis, Classification and Regression Tree (CART) models, C5, and Support Vector Machines (SVM) in predicting the discharge status (alive or deceased, with “deceased” being the class of interest) of patients from an Intensive Care Unit (ICU). Using a variety of misclassification cost ratio (MCR) values and using specificity, recall, precision, the F-measure, and confusion entropy (CEN) as criteria for evaluating each method’s performance, C5 and SVM performed better than the other methods. At a MCR of 100, C5 had the highest recall and SVM the highest specificity and lowest CEN. We also used Hand’s measure to compare the five methods. According to Hand’s measure, logistic regression performed the best. This article makes several contributions. We show how the use of MCR for analyzing imbalanced medical data significantly improves the method’s classification performance. We also found that the F-measure and precision did not improve as the MCR was increased.

[1]  Yanchun Zhang,et al.  Toward breast cancer survivability prediction models through improving training space , 2009, Expert Syst. Appl..

[2]  D. Kleinbaum,et al.  Applied Regression Analysis and Other Multivariate Methods , 1978 .

[3]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[4]  Raju S. Bapi,et al.  An Unbalanced Data Classification Model Using Hybrid Sampling Technique for Fraud Detection , 2007, PReMI.

[5]  Taghi M. Khoshgoftaar,et al.  Knowledge discovery from imbalanced and noisy data , 2009, Data Knowl. Eng..

[6]  Yufeng Liu,et al.  Adaptive Weighted Learning for Unbalanced Multicategory Classification , 2009, Biometrics.

[7]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[8]  A. Rosenberg,et al.  Patients readmitted to ICUs* : a systematic review of risk factors and outcomes. , 2000, Chest.

[9]  David J Hand,et al.  Evaluating diagnostic tests: The area under the ROC curve and the balance of errors , 2010, Statistics in medicine.

[10]  Han Tong Loh,et al.  Imbalanced text classification: A term weighting approach , 2009, Expert Syst. Appl..

[11]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[12]  Qinghua Hu,et al.  A novel measure for evaluating classifiers , 2010, Expert Syst. Appl..

[13]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[14]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[15]  Dirk Van den Poel,et al.  Handling class imbalance in customer churn prediction , 2009, Expert Syst. Appl..

[16]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[17]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[18]  A. K. Pujari,et al.  Data Mining Techniques , 2006 .

[19]  Songbo Tan,et al.  Neighbor-weighted K-nearest neighbor for unbalanced text corpus , 2005, Expert Syst. Appl..

[20]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.