Application of irregular and unbalanced data to predict diabetic nephropathy using visualization and feature selection methods

OBJECTIVE Diabetic nephropathy is damage to the kidney caused by diabetes mellitus. It is a common complication and a leading cause of death in people with diabetes. However, the decline in kidney function varies considerably between patients and the determinants of diabetic nephropathy have not been clearly identified. Therefore, it is very difficult to predict the onset of diabetic nephropathy accurately with simple statistical approaches such as t-test or chi(2)-test. To accurately predict the onset of diabetic nephropathy, we applied various machine learning techniques to irregular and unbalanced diabetes dataset, such as support vector machine (SVM) classification and feature selection methods. Visualization of the risk factors was another important objective to give physicians intuitive information on each patient's clinical pattern. METHODS AND MATERIALS We collected medical data from 292 patients with diabetes and performed preprocessing to extract 184 features from the irregular data. To predict the onset of diabetic nephropathy, we compared several classification methods such as logistic regression, SVM, and SVM with a cost sensitive learning method. We also applied several feature selection methods to remove redundant features and improve the classification performance. For risk factor analysis with SVM classifiers, we have developed a new visualization system which uses a nomogram approach. RESULTS Linear SVM classifiers combined with wrapper or embedded feature selection methods showed the best results. Among the 184 features, the classifiers selected the same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics analysis. The visualization tool was able to present the effect of each feature on the decision via graphical output. CONCLUSIONS Our proposed method can predict the onset of diabetic nephropathy about 2-3 months before the actual diagnosis with high prediction performance from an irregular and unbalanced dataset, which statistical methods such as t-test and logistic regression could not achieve. Additionally, the visualization system provides physicians with intuitive information for risk factor analysis. Therefore, physicians can benefit from the automatic early warning of each patient and visualize risk factors, which facilitate planning of effective and proper treatment strategies.

[1]  A. Karter,et al.  Developing a prediction rule from automated clinical databases to identify high-risk patients in a large population with diabetes. , 2001, Diabetes care.

[2]  Joseph L. Breault,et al.  Data mining a diabetic data warehouse , 2002, Artif. Intell. Medicine.

[3]  M. Shichiri,et al.  Long-term results of the Kumamoto Study on optimal diabetes control in type 2 diabetic patients. , 2000, Diabetes care.

[4]  R. Holman,et al.  Intensive blood-glucose control with sulphonylureas or insulin compared with conventional treatment and risk of complications in patients with type 2 diabetes (UKPDS 33). UK Prospective Diabetes Study (UKPDS) Group. , 1998 .

[5]  Murali S. Shanker,et al.  Using Neural Networks To Predict the Onset of Diabetes Mellitus , 1996, J. Chem. Inf. Comput. Sci..

[6]  Yuval Shahar,et al.  Knowledge-based temporal abstraction in clinical domains , 1996, Artif. Intell. Medicine.

[7]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[8]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[9]  Ian Witten,et al.  Data Mining , 2000 .

[10]  C. Mogensen,et al.  Increased blood pressure in diabetes: essential hypertension or diabetic nephropathy? , 1987, Scandinavian journal of clinical and laboratory investigation.

[11]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[12]  E. B. Pedersen,et al.  Effect of antihypertensive treatment on urinary albumin excretion, glomerular filtration rate, and renal plasma flow in patients with essential hypertension. , 1976, Scandinavian journal of clinical and laboratory investigation.

[13]  Krzysztof J. Cios,et al.  Uniqueness of medical data mining , 2002, Artif. Intell. Medicine.

[14]  E R Carson,et al.  Decision support systems in diabetes: a systems perspective. , 1998, Computer methods and programs in biomedicine.

[15]  Ukpds,et al.  Intensive blood-glucose control with sulphonylureas or insulin compared with conventional treatment and risk of complications in patients with type 2 diabetes , 2002 .

[16]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[17]  Silvia Miksch,et al.  Combining Diagnosis and Treatment Using Asbru , 2001, MedInfo.

[18]  N. Ehlers,et al.  MICROALBUMINURIA PREDICTS PROLIFERATIVE DIABETIC RETINOPATHY , 1985, The Lancet.

[19]  C. Tsalamandris,et al.  Early nephropathy predicts vision-threatening retinal disease in patients with type I diabetes mellitus. , 1998, Journal of the American Society of Nephrology : JASN.

[20]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[21]  J. Gross,et al.  Diabetic nephropathy: diagnosis, prevention, and treatment. , 2005, Diabetes care.

[22]  Bernard Widrow,et al.  Sensitivity of feedforward neural networks to weight errors , 1990, IEEE Trans. Neural Networks.

[23]  C. Mogensen,et al.  Microalbuminuria, blood pressure and diabetic renal disease: origin and development of ideas , 1999, Diabetologia.

[24]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[25]  Carolyn McGregor,et al.  Temporal abstraction in intelligent clinical data analysis: A survey , 2007, Artif. Intell. Medicine.

[26]  Riccardo Bellazzi,et al.  Intelligent analysis of clinical time series: an application in the diabetes mellitus domain , 2000, Artif. Intell. Medicine.

[27]  Jin Park,et al.  A sequential neural network model for diabetes prediction , 2001, Artif. Intell. Medicine.

[28]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[29]  Aleksander Mendyk,et al.  Artificial intelligence technology as a tool for initial GDM screening , 2004, Expert Syst. Appl..

[30]  Ivan Bratko,et al.  Nomograms for visualizing support vector machines , 2005, KDD '05.

[31]  Evert de Jonge,et al.  Temporal abstraction for feature extraction: A comparative case study in prediction from intensive care monitoring data , 2007, Artif. Intell. Medicine.

[32]  Yuval Shahar,et al.  Knowledge acquisition for temporal-abstraction mechanisms , 1992 .

[33]  E. Agardh,et al.  The prognostic value of albuminuria for the development of cardiovascular disease and retinopathy: a 5-year follow-up of 451 patients with type 2 diabetes mellitus. , 1996, Diabetes research and clinical practice.

[34]  P. Zimmet,et al.  Definition, diagnosis and classification of diabetes mellitus and its complications. Part 1: diagnosis and classification of diabetes mellitus. Provisional report of a WHO Consultation , 1998, Diabetic medicine : a journal of the British Diabetic Association.

[35]  K Borch-Johnsen,et al.  Predictors of mortality in insulin dependent diabetes: 10 year observational follow up study , 1996, BMJ.