IMPROVING THE PERFORMANCE OF K-NEAREST NEIGHBOR ALGORITHM FOR THE CLASSIFICATION OF DIABETES DATASET WITH MISSING VALUES

In today’s world, people get affected by many diseases which cannot be completely cured. Diabetes is one such disease and is now a big growing health problem. It leads to the risk of heart attack, kidney failure and renal disease. The techniques of data mining have been widely applied to extract knowledge from medical databases. In this paper, we evaluated the performance of knearest neighbor(kNN) algorithm for classification of Diabetes data. We considered the data imputation, scaling and normalization techniques to improve the accuracy of the classifier while using diabetes data, which may contain lot of missing values. We selected to explore KNN because, it is very simple and faster than most of the complex classification algorithms. To measure the performance, we used Accuracy and Error rate as the metrics. We found that data imputation method will not lead to higher accuracy; instead it will give correct accuracy for missing values and imputation along with a suitable data preprocessing method increases the accuracy.

[1]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[2]  Philip S. Yu,et al.  Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[3]  Kemal Polat,et al.  The Medical Applications of Attribute Weighted Artificial Immune System (AWAIS): Diagnosis of Heart and Diabetes Diseases , 2005, ICARIS.

[4]  P. Corso,et al.  Hormone therapy: making decisions in the face of uncertainty. , 2004, Archives of internal medicine.

[5]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[6]  Mohammad Saniee Abadeh,et al.  Using fuzzy ant colony optimization for diagnosis of diabetes disease , 2010, ICEE 2010.

[7]  Albert Y. Zomaya,et al.  A particle swarm based hybrid system for imbalanced medical data sampling , 2009, BMC Genomics.

[8]  Joseph M. Reinhardt,et al.  Mammographic masses classification: comparison between backpropagation neural network (BNN), K nearest neighbors (KNN), and human readers , 2003, CCECE 2003 - Canadian Conference on Electrical and Computer Engineering. Toward a Caring and Humane Technology (Cat. No.03CH37436).

[9]  A. Santhakumaran,et al.  A Novel Classification Method for Diagnosis of Diabetes Mellitus Using Artificial Neural Networks , 2010, 2010 International Conference on Data Storage and Data Engineering.

[10]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[11]  Angeline Christobel,et al.  An Empirical Comparison of Data Mining Classification Methods , 2011 .

[12]  Mostafa Fathi Ganji,et al.  Using fuzzy ant colony optimization for diagnosis of diabetes disease , 2010, 2010 18th Iranian Conference on Electrical Engineering.

[13]  Vincent J. Carey,et al.  Supervised Machine Learning , 2008 .

[14]  Roger G. Stone,et al.  Naive Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages , 2009 .

[15]  Biswadip Ghosh Using Fuzzy Classification for Chronic Disease Management , 2012 .

[16]  Dimitrios I. Fotiadis,et al.  Automated creation of transparent fuzzy models based on decision trees - application to diabetes diagnosis , 2008, 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[17]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[18]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[19]  P. Umar Sathic Ali,et al.  Improved Evidence Theoretic kNN Classifier based on Theory of Evidence , 2011 .

[20]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[21]  Chong Gu,et al.  Soft Classification, a. k. a. Risk Estimation, via Penalized Log Likelihood and Smoothing Spline Ana , 1993 .

[22]  N. Satyanarayana,et al.  Survey of Classification Techniques in Data Mining , 2014 .

[23]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[24]  Manaswini Pradhan,et al.  Predict the onset of diabetes disease using Artificial Neural Network (ANN) , 2011 .

[25]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[26]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[27]  R Nedunchezhian,et al.  Evaluation of three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values , 2011 .

[28]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[29]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[30]  K. Jearanaitanakij Classifying Continuous Data Set by ID3 Algorithm , 2005, 2005 5th International Conference on Information Communications & Signal Processing.

[31]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[32]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[33]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[34]  A. A. Shafie,et al.  Application of modeling techniques to diabetes diagnosis , 2010, 2010 IEEE EMBS Conference on Biomedical Engineering and Sciences (IECBES).