Predicting disease by using data mining based on healthcare information system

This paper applies the data mining process to predict hypertension from patient medical records with eight other diseases. A sample with the size of 9862 cases has been studied. The sample was extracted from a real world Healthcare Information System database containing 309383 medical records. We observed that the distribution of patient diseases in the medical database is imbalanced. Under-sampling technique has been applied to generate training data sets, and data mining tool Weka has been used to generate the NaIve Bayesian and J-48 classifiers. In addition, an ensemble of five J-48 classifiers was created trying to improve the prediction performance, and rough set tools were used to reduce the ensemble based on the idea of second-order approximation. Experimental results showed a little improvement of the ensemble approach over pure Na'ive Bayesian and J-48 in accuracy, sensitivity, and F-measure.

[1]  Norberto F. Ezquerra,et al.  Mining constrained association rules to predict heart disease , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[2]  M Anbarasi,et al.  ENHANCED PREDICTION OF HEART DISEASE WITH FEATURE SUBSET SELECTION USING GENETIC ALGORITHM , 2010 .

[3]  D. Lubeck,et al.  Predicting disease recurrence in intermediate and high-risk patients undergoing radical prostatectomy using percent positive biopsies: results from CaPSURE. , 2002, Urology.

[4]  Merrick I Ross,et al.  Positive surgical margins and ipsilateral breast tumor recurrence predict disease‐specific survival after breast‐conserving therapy , 2003, Cancer.

[5]  D O Cosgrove,et al.  Hepatic vein transit times using a microbubble agent can predict disease severity non-invasively in patients with hepatitis C , 2004, Gut.

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  A. Sharrett,et al.  Coronary Heart Disease Prediction From Lipoprotein Cholesterol Levels, Triglycerides, Lipoprotein(a), Apolipoproteins A-I and B, and HDL Density Subfractions: The Atherosclerosis Risk in Communities (ARIC) Study , 2001, Circulation.

[8]  L. Mofenson,et al.  Maternal and Infant Factors Predicting Disease Progression in Human Immunodeficiency Virus Type 1-Infected Infants , 2000, Pediatrics.

[9]  J. Bell Predicting disease using genomics , 2004, Nature.

[10]  Manuel Hidalgo,et al.  Expression of epiregulin and amphiregulin and K-ras mutation status predict disease control in metastatic colorectal cancer patients treated with cetuximab. , 2007, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[11]  P. Greenland,et al.  Coronary artery calcium score and risk classification for coronary heart disease prediction. , 2010, JAMA.

[12]  Frances S. Turner,et al.  POCUS: mining genomic sequence annotation to predict disease genes , 2003, Genome Biology.

[13]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[14]  C. Stegeman,et al.  Anti-neutrophil cytoplasmic antibody (ANCA) levels directed against proteinase-3 and myeloperoxidase are helpful in predicting disease relapse in ANCA-associated small-vessel vasculitis. , 2002, Nephrology, dialysis, transplantation : official publication of the European Dialysis and Transplant Association - European Renal Association.

[15]  Cornelis J H van de Velde,et al.  Validation of a nomogram for predicting disease‐specific survival after an R0 resection for gastric carcinoma , 2005, Cancer.

[16]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[17]  H. Moser,et al.  X-linked adrenoleukodystrophy: the role of contrast-enhanced MR imaging in predicting disease progression. , 2000, AJNR. American journal of neuroradiology.

[18]  Chris. Drummond,et al.  C 4 . 5 , Class Imbalance , and Cost Sensitivity : Why Under-Sampling beats OverSampling , 2003 .

[19]  Ms. Ishtake " Intelligent Heart Disease Prediction System Using Data Mining Techniques " , .

[20]  Szymon Wilk,et al.  Rough Set Based Data Exploration Using ROSE System , 1999, ISMIS.