论文信息 - EFFECTIVENESS OF K-MEANS CLUSTERING TO DISTRIBUTE TRAINING DATA AND TESTING DATA ON K-NEAREST NEIGHBOR CLASSIFICATION

EFFECTIVENESS OF K-MEANS CLUSTERING TO DISTRIBUTE TRAINING DATA AND TESTING DATA ON K-NEAREST NEIGHBOR CLASSIFICATION

One of the constraints in classification is how to divide the dataset into two parts, training and testing which can represent every data distribution. The most commonly used technique is K-Fold Cross Validation which divides data into several parts and alternately into training data and testing data. In addition, the commonly used technique is to divide data into percentage form (70% and 30%), also become an option in data mining research. K-Means is a grouping algorithm which able to maximizes the effectiveness of distributing data in classification. The experiments performed using K-Means Clustering against K-Nearest Neighbor (K-NN) which was validated by Confusion Matrix have the highest accuracy of 93.4%, it is higher than the K-Fold Cross Validation data distribution technique for each experiment using data Education Management Information System (EMIS) as well as random data. The concept of distributing data in groups can be a representative to each member and increase the accuracy of classification algorithm, although the experiment only applied 70% of training data and 30% of testing data in each group.

[1] Hani S. Mitri,et al. Classification of Rockburst in Underground Projects: Comparison of Ten Supervised Learning Methods , 2016, J. Comput. Civ. Eng..

[2] A. K. Santra,et al. Genetic Algorithm and Confusion Matrix for Document Clustering , 2012 .

[3] S. G. Rao. Performance Validation of the Modified K-Means Clustering Algorithm Clusters Data , 2015 .

[4] Roland L. Dunbrack,et al. The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics , 2013, PloS one.

[5] M. Emre Celebi,et al. Investigation of Internal Validity Measures for K-Means Clustering , 2012 .

[6] Hamid Parvin,et al. A Modification on K-Nearest Neighbor Classifier , 2010 .

[8] G. Malini Devi,et al. Performance Assessment of Neural Network and K-Nearest Neighbour Classification with Random Subwindows , 2012 .

[9] Ahmad Basheer Hassanat,et al. Solving the Problem of the K Parameter in the KNN Classifier Using an Ensemble Learning Approach , 2014, ArXiv.

[10] Asha Gowda Karegowda,et al. Cascading K-means Clustering and K-Nearest Neighbor Classifier for Categorization of Diabetic Patients , 2012 .

[11] Rupa G. Mehta,et al. Impact of Outlier Removal and Normalization Approach in Modified k-Means Clustering Algorithm , 2011 .

[12] Harjit Singh,et al. A Hybrid K-Mean Clustering Algorithm for Prediction Analysis , 2016 .

[13] Madhu Yedla,et al. Enhancing K-means Clustering Algorithm with Improved Initial Center , 2010 .