Prediction of diseases by cascading clustering and classification

Diagnosis of the disease is one of the application areas where data mining techniques helps in the extraction of knowledge from medical database. Recently, researchers have been investigating the effect of cascading more than one technique showing enhanced results in the diagnosis of the disease. This paper proposes a hybrid model using K-means as a preprocessing algorithm. The proposed model is developed in four stages. In the initial stage, datasets selected from the UCI repository is cleaned by deleting all the instances with missing values. In the second stage Best First search algorithm and Correlation based feature selection (CFS) are used in a cascaded fashion for relevant feature selection In the third stage the resultant dataset (binary class datasets) is then clustered into two segments using K-means and incorrectly clustered samples are eliminated to get final samples. Finally, the correctly clustered samples from the previous stage is trained with 12 different classifiers to build the final classifier model, using Stratified 10 fold cross validation. Experimental results proved that cascaded K-means clustering and classification with CFS and Best First as a Feature selection method showed enhanced classification accuracy on an average of 95% and above on 5 different medical datasets with all 12 classifiers.