Data Mining Diabetic Databases: Are Rough Sets a Useful Addition?

The publicly available Pima Indian diabetic database (PIDD) at the UCIrvine Machine Learning Lab has become a standard for testing data mining algorithms to see their accuracy in predicting diabetic status from the 8 variables given. Looking at the 392 complete cases, guessing all are non-diabetic gives an accuracy of 65.1%. Since 1988, many dozens of publications using various algorithms have resulted in accuracy rates of 66% to 81%. Rough sets as a data mining predictive tool has been used in medical areas since the late 1980s, but not applied to the PIDD to our knowledge. When we apply rough sets to PIDD using ROSETTA software, there are many different options within the software to choose from. The predictive accuracy was 73.8% with a 95% CI of (71.3%, 76.3%) with one of the methods we used. Rough sets are a useful addition to the analysis of diabetic databases.

[1]  Shusaku Tsumoto Automated knowledge acquisition from clinical databases based on rough sets and attribute-oriented generalization , 1998, AMIA.

[2]  David H. Wolpert,et al.  The Mathematics of Generalization: The Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning , 1994 .

[3]  Gail A. Carpenter,et al.  ARTMAP-IC and medical diagnosis: Instance counting and inconsistent cases , 1998, Neural Networks.

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  S Tsumoto,et al.  Induction of medical expert system rules based on rough sets and resampling methods. , 1994, Proceedings. Symposium on Computer Applications in Medical Care.

[7]  W Podraza,et al.  Childhood leukaemia relapse risk factors. A rough sets approach. , 1999, Medical informatics and the Internet in medicine.

[8]  P. Kopelman,et al.  Application of database systems in diabetes care. , 1996, Medical informatics = Medecine et informatique.

[9]  Tao Jiang,et al.  Datamining: Discovering Information from Bio-Data , 2002 .

[10]  Jan Komorowski,et al.  Modelling cardiac patient set residuals using rough sets , 1997, AMIA.

[11]  Jan C. Bioch,et al.  Classification using Bayesian neural nets , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[12]  Aleksander Ohrn,et al.  ROSETTA -- A Rough Set Toolkit for Analysis of Data , 1997 .

[13]  S Tsumoto,et al.  Induction of expert system rules based on rough sets and resampling methods. , 1995, Medinfo. MEDINFO.

[14]  R. Weinstock,et al.  Diabetes Prevalence and Hospital and Pharmacy Use in the Veterans Health Administration (1994): Use of an ambulatory care pharmacy-derived database , 1998, Diabetes Care.

[15]  Chong Gu,et al.  Soft Classification, a. k. a. Risk Estimation, via Penalized Log Likelihood and Smoothing Spline Ana , 1993 .

[16]  Marguerite Summers,et al.  Evaluation of fourteen desktop data mining tools , 1998, SMC'98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.98CH36218).

[17]  Andrzej Skowron,et al.  Rough-Fuzzy Hybridization: A New Trend in Decision Making , 1999 .

[18]  J. Flack Seven years experience with a computerized diabetes clinic database. , 1995, Medinfo. MEDINFO.

[19]  Aleksander Øhrn,et al.  Discernibility and Rough Sets in Medicine: Tools and Applications , 2000 .

[20]  Richard S. Johannes,et al.  Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus , 1988 .

[21]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[22]  D G Kelling,et al.  Diabetes mellitus. Using a database to implement a systematic management program. , 1997, North Carolina medical journal.

[23]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[24]  G I Paterson A rough sets approach to patient classification in medical records. , 1995, Medinfo. MEDINFO.

[25]  R. Słowiński,et al.  Rough sets approach to analysis of data from peritoneal lavage in acute pancreatitis. , 1988, Medical informatics = Medecine et informatique.

[26]  P. Bennett,et al.  Diabetes incidence and prevalence in Pima Indians: a 19-fold greater incidence than in Rochester, Minnesota. , 1978, American journal of epidemiology.

[27]  Witold Pedrycz,et al.  Data Mining Methods for Knowledge Discovery , 1998, IEEE Trans. Neural Networks.

[28]  C Beguin,et al.  Using a database to query for diabetes mellitus. , 1994, Studies in health technology and informatics.

[29]  F R Jelovsek,et al.  Developmental toxicity risk assessment: a rough sets approach. , 1993, Methods of information in medicine.

[30]  Wojciech Ziarko,et al.  The Discovery, Analysis, and Representation of Data Dependencies in Databases , 1991, Knowledge Discovery in Databases.

[31]  Jaroslaw Stepaniuk,et al.  Rough Set Data Mining of Diabetes Data , 1999, ISMIS.

[32]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[33]  A. H. Khan Multiplier-free feedforward networks , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[34]  P. Sönksen,et al.  Linking a hospital diabetes database and the National Health Service Central Register: a way to establish accurate mortality and movement data , 1997, Diabetic medicine : a journal of the British Diabetic Association.

[35]  M. Ehm,et al.  An autosomal genomic scan for loci linked to type II diabetes mellitus and body-mass index in Pima Indians. , 1998, American journal of human genetics.

[36]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[37]  Peter D. Turney Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm , 1994, J. Artif. Intell. Res..

[38]  A. Ohrn,et al.  Rough sets: a knowledge discovery technique for multifactorial medical outcomes. , 2000, American journal of physical medicine & rehabilitation.