Impact of K-Means on the Performance of Classifiers for Labeled Data

In this study a novel framework for data mining in clinical decision making have been proposed. Our framework addresses the problems of assessing and utilizing data mining models in medical domain. The framework consists of three stages. The first stage involves preprocessing of the data to improve its quality. The second stage employs k-means clustering algorithm to cluster the data into k clusters (in our case, k=2 i.e. cluster0 / no, cluster1 / yes) for validation the class labels associated with the data. After clustering, the class labels associated with the data is compared with the labels generated by clustering algorithm if both the labels are same it is assumed that the data is correctly classified. The instances for which the labels are not same are considered to be misclassified and are removed before further processing. In the third stage support vector machine classification is applied. The classification model is validated by using k-fold cross validation method. The performance of SVM (Support Vector Machine) classifier is also compared with Naive Bayes classifier. In our case SVM classifier outperforms the Naive Bayes classifier. To validate the proposed framework, experiments have been carried out on benchmark datasets such as Indian Pima diabetes dataset and Wisconsin breast cancer dataset (WBCD).These datasets were obtained from the University of California at Irvine (UCI) machine learning repository. Our proposed study obtained classification accuracy on both datasets, which is better with respect to the other classification algorithms applied on the same datasets as cited in the literature. The performance of the proposed framework was also evaluated using the sensitivity and specificity measures.

[1]  Nikola K. Kasabov,et al.  On-line pattern analysis by evolving self-organizing maps , 2003, Neurocomputing.

[2]  Kemal Polat,et al.  An expert system approach based on principal component analysis and adaptive neuro-fuzzy inference system to diagnosis of diabetes disease , 2007, Digit. Signal Process..

[3]  Dursun Delen,et al.  Predicting breast cancer survivability: a comparison of three data mining methods , 2005, Artif. Intell. Medicine.

[4]  Andrew P. Sage,et al.  Uncertainty in Artificial Intelligence , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[6]  Huan Liu,et al.  Book review: Machine Learning, Neural and Statistical Classification Edited by D. Michie, D.J. Spiegelhalter and C.C. Taylor (Ellis Horwood Limited, 1994) , 1996, SGAR.

[7]  Kemal Polat,et al.  A cascade learning system for classification of diabetes disease: Generalized Discriminant Analysis and Least Square Support Vector Machine , 2008, Expert Syst. Appl..

[8]  Novruz Allahverdi,et al.  Design of a hybrid system for the diabetes and heart diseases , 2008, Expert Syst. Appl..

[9]  T. Yıldırım,et al.  MEDICAL DIAGNOSIS ON PIMA INDIAN DIABETES USING GENERAL REGRESSION NEURAL NETWORKS , 2003 .

[10]  U. Rajendra Acharya,et al.  Automated Identification of Diabetic Type 2 Subjects with and without Neuropathy Using Wavelet Transform on Pedobarograph , 2008, Journal of Medical Systems.

[11]  Kemal Polat,et al.  Breast cancer diagnosis using least square support vector machine , 2007, Digit. Signal Process..

[12]  Vir V. Phoha,et al.  K-Means+ID3: A Novel Method for Supervised Anomaly Detection by Cascading K-Means Clustering and ID3 Decision Tree Learning Methods , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[14]  Thora Jonsdottir,et al.  The feasibility of constructing a Predictive Outcome Model for breast cancer using the tools of data mining , 2008, Expert Syst. Appl..

[15]  Ian Witten,et al.  Data Mining , 2000 .

[16]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[17]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[18]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[19]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[20]  Gail A. Carpenter,et al.  ARTMAP-IC and medical diagnosis: Instance counting and inconsistent cases , 1998, Neural Networks.

[21]  Kemal Polat,et al.  Principles component analysis, fuzzy weighting pre-processing and artificial immune recognition system based diagnostic system for diagnosis of lung cancer , 2008, Expert Syst. Appl..

[22]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[23]  Jan C. Bioch,et al.  Classification using Bayesian neural nets , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).