Predicting disease risks from highly imbalanced data using random forest

BackgroundWe present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.MethodsWe employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases.ResultsWe predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process.ConclusionsIn combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.

[1]  B. Yawn,et al.  Identifying Persons with Diabetes Using Medicare Claims Data , 1999, American journal of medical quality : the official journal of the American College of Medical Quality.

[2]  Glen T. Cameron,et al.  Cancer Coverage in General-Audience and Black Newspapers , 2008, Health communication.

[3]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[4]  John Mingers,et al.  An Empirical Comparison of Selection Measures for Decision-Tree Induction , 1989, Machine Learning.

[5]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[6]  Feng Zeng,et al.  A Comparative Study of Ensemble Learning Approaches in the Classification of Breast Cancer Metastasis , 2009, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing.

[7]  Darcy A. Davis,et al.  Predicting individual disease risk based on medical history , 2008, CIKM '08.

[8]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Mirella Lapata,et al.  Proceedings of the National Conference on Artificial Intelligence , 2011 .

[11]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[12]  Yi Tan,et al.  The application of machine learning algorithm in underwriting process , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[13]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[14]  Huan Liu,et al.  Predicting Future High-Cost Patients: A Real-World Risk Modeling Application , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[15]  Robert C. Glen,et al.  Random Forest Models To Predict Aqueous Solubility , 2007, J. Chem. Inf. Model..

[16]  Hans C. van Houwelingen,et al.  The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, New York, 2001. No. of pages: xvi+533. ISBN 0‐387‐95284‐5 , 2004 .

[17]  Foster Provost,et al.  Machine Learning from Imbalanced Data Sets 101 , 2008 .

[18]  George C. Anastassopoulos,et al.  Medical disease prediction using Artificial Neural Networks , 2008, 2008 8th IEEE International Conference on BioInformatics and BioEngineering.

[19]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.

[20]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[21]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[22]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[23]  Bjoern H. Menze,et al.  A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data , 2009, BMC Bioinformatics.

[24]  James M. Keller,et al.  A smart home application to eldercare: current status and lessons learned. , 2009, Technology and health care : official journal of the European Society for Engineering and Medicine.

[25]  Muin J. Khoury,et al.  Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes , 2010, BMC Medical Informatics Decis. Mak..