Cascading K-means Clustering and K-Nearest Neighbor Classifier for Categorization of Diabetic Patients

 Abstract— Medical Data mining is the process of extracting hidden patterns from medical data. This paper presents the development of a hybrid model for classifying Pima Indian diabetic database (PIDD). The model consists of three stages. In the first stage, K-means clustering is used to identify and eliminate incorrectly classified instances. In the second stage Genetic algorithm (GA) and Correlation based feature selection (CFS) is used in a cascaded fashion for relevant feature extraction, where GA rendered global search of attributes with fitness evaluation effected by CFS. Finally in the third stage a fine tuned classification is done using K-nearest neighbor (KNN) by taking the correctly clustered instance of first stage and with feature subset identified in the second stage as inputs for the KNN. Experimental results signify the cascaded K-means clustering and KNN along with feature subset identified GA_CFS has enhanced classification accuracy of KNN. The proposed model obtained the classification accuracy of 96.68% for diabetic dataset.