EFFECTIVENESS OF K-MEANS CLUSTERING TO DISTRIBUTE TRAINING DATA AND TESTING DATA ON K-NEAREST NEIGHBOR CLASSIFICATION

One of the constraints in classification is how to divide the dataset into two parts, training and testing which can represent every data distribution. The most commonly used technique is K-Fold Cross Validation which divides data into several parts and alternately into training data and testing data. In addition, the commonly used technique is to divide data into percentage form (70% and 30%), also become an option in data mining research. K-Means is a grouping algorithm which able to maximizes the effectiveness of distributing data in classification. The experiments performed using K-Means Clustering against K-Nearest Neighbor (K-NN) which was validated by Confusion Matrix have the highest accuracy of 93.4%, it is higher than the K-Fold Cross Validation data distribution technique for each experiment using data Education Management Information System (EMIS) as well as random data. The concept of distributing data in groups can be a representative to each member and increase the accuracy of classification algorithm, although the experiment only applied 70% of training data and 30% of testing data in each group.

[1]  Hani S. Mitri,et al.  Classification of Rockburst in Underground Projects: Comparison of Ten Supervised Learning Methods , 2016, J. Comput. Civ. Eng..

[2]  A. K. Santra,et al.  Genetic Algorithm and Confusion Matrix for Document Clustering , 2012 .

[3]  S. G. Rao Performance Validation of the Modified K-Means Clustering Algorithm Clusters Data , 2015 .

[4]  Roland L. Dunbrack,et al.  The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics , 2013, PloS one.

[5]  M. Emre Celebi,et al.  Investigation of Internal Validity Measures for K-Means Clustering , 2012 .

[6]  Hamid Parvin,et al.  A Modification on K-Nearest Neighbor Classifier , 2010 .

[8]  G. Malini Devi,et al.  Performance Assessment of Neural Network and K-Nearest Neighbour Classification with Random Subwindows , 2012 .

[9]  Ahmad Basheer Hassanat,et al.  Solving the Problem of the K Parameter in the KNN Classifier Using an Ensemble Learning Approach , 2014, ArXiv.

[10]  Asha Gowda Karegowda,et al.  Cascading K-means Clustering and K-Nearest Neighbor Classifier for Categorization of Diabetic Patients , 2012 .

[11]  Rupa G. Mehta,et al.  Impact of Outlier Removal and Normalization Approach in Modified k-Means Clustering Algorithm , 2011 .

[12]  Harjit Singh,et al.  A Hybrid K-Mean Clustering Algorithm for Prediction Analysis , 2016 .

[13]  Madhu Yedla,et al.  Enhancing K-means Clustering Algorithm with Improved Initial Center , 2010 .

[14]  Chris H. Q. Ding,et al.  PCA-guided search for K-means , 2015, Pattern Recognit. Lett..

[15]  D Napoleon,et al.  A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set , 2011 .

[16]  Fahmida Afrin,et al.  Comparative Performance Of Using PCA With K-Means And Fuzzy C Means Clustering For Customer Segmentation , 2015 .

[17]  Singiresu S. Rao Engineering Optimization : Theory and Practice , 2010 .

[18]  Support Vector Regression modelling for rainfall prediction in dry season based on Southern Oscillation Index and NINO3.4 , 2013, 2013 International Conference on Advanced Computer Science and Information Systems (ICACSIS).

[19]  Qi Li,et al.  Fast k-means algorithm clustering , 2011, ArXiv.

[20]  S. Imandoust,et al.  Application of K-Nearest Neighbor (KNN) Approach for Predicting Economic Events: Theoretical Background , 2013 .

[21]  Irman Hermadi,et al.  Performance Comparison Between Support Vector Regression and Artificial Neural Network for Prediction of Oil Palm Production , 2016 .

[22]  H S Khamis,et al.  APPLICATION OF k- NEAREST NEIGHBOUR CLASSIFICATION IN MEDICAL DATA MINING IN THE CONTEXT OF KENYA , 2014 .

[23]  Dr. Antony Selvadoss Thanamani,et al.  An Effective Determination of Initial Centroids in K-Means Clustering Using Kernel PCA , .

[24]  Hassan A. Kingravi,et al.  Deterministic Initialization of the k-Means Algorithm using Hierarchical Clustering , 2012, Int. J. Pattern Recognit. Artif. Intell..

[25]  Yogendra Kumar Jain,et al.  Min Max Normalization Based Data Perturbation Method for Privacy Protection , 2011 .

[26]  Hendrik Blockeel,et al.  On estimating model accuracy with repeated cross-validation , 2012 .

[27]  Niyati Gupta,et al.  Accuracy, Sensitivity and Specificity Measurement of Various Classification Techniques on Healthcare Data , 2013 .

[28]  Yambem Jina Chanu,et al.  Image Segmentation Using K -means Clustering Algorithm and Subtractive Clustering Algorithm , 2015 .

[29]  Maziar Palhang,et al.  Generalization performance of support vector machines and neural networks in runoff modeling , 2009, Expert Syst. Appl..