PENERAPAN TEKNIK BAGGING PADA ALGORITMA KLASIFIKASI UNTUK MENGATASI KETIDAKSEIMBANGAN KELAS DATASET MEDIS

ABSTRACT – The class imbalance problems have been reported to severely hinder classification performance of many standard learning algorithms, and have attracted a great deal of attention from researchers of different fields. Therefore, a number of methods, such as sampling methods, cost-sensitive learning methods, and bagging and boosting based ensemble methods, have been proposed to solve these problems. Some medical dataset has two classes has two classes or binominal experiencing an imbalance that causes lack of accuracy in classification. This research proposed a combination technique of bagging and algorithms of classification to improve the accuracy of medical datasets. Bagging technique used to solve the problem of imbalanced class. The proposed method is applied on three classifier algorithm i.e., naive bayes, decision tree and k-nearest neighbor. This research uses five medical datasets obtained from UCI Machine Learning i.e.., breast-cancer, liver-disorder, heart-disease, pima-diabetes and vertebral column. Results of this research indicate that the proposed method makes a significant improvement on two algorithms of classification i.e. decision tree with p value of t-Test 0.0184 and k-nearest neighbor with p value of t-Test 0.0292, but not significant in naive bayes with p value of t-Test 0.9236. After bagging technique applied at five medical datasets, naive bayes has the highest accuracy for breast-cancer dataset of 96.14% with AUC of 0.984, heart-disease of 84.44% with AUC of 0.911 and pima-diabetes of 74.73% with AUC of 0.806. While the k-nearest neighbor has the best accuracy for dataset liver-disorder of 62.03% with AUC of 0.632 and vertebral-column of 82.26% with the AUC of 0.867. Keywords: ensemble technique, bagging, imbalanced class, medical dataset. ABSTRAKSI – Masalah ketidakseimbangan kelas telah dilaporkan sangat menghambat kinerja klasifikasi banyak algoritma klasifikasi dan telah menarik banyak perhatian dari para peneliti dari berbagai bidang. Oleh karena itu, sejumlah metode seperti metode sampling, cost-sensitive learning, serta bagging dan boosting, telah diusulkan untuk memecahkan masalah ini. Beberapa dataset medis yang memiliki dua kelas atau binominal mengalami ketidakseimbangan kelas yang menyebabkan kurangnya akurasi pada klasifikasi. Pada penelitian ini diusulkan kombinasi teknik bagging dan algoritma klasifikasi untuk meningkatkan akurasi dari klasifikasi dataset medis. Teknik bagging digunakan untuk menyelesaikan masalah ketidakseimbangan kelas. Metode yang diusulkan diterapkan pada tiga algoritma classifier yaitu, naive bayes, decision tree dan k-nearest neighbor. Penelitian ini menggunakan lima dataset medis yang didapatkan dari UCI Machine Learning yaitu, breast-cancer, liver-disorder, heart-disease, pima-diabetes dan vertebral column. Hasil penelitian menunjukan bahwa metode yang diusulkan membuat peningkatan yang signifikan pada dua algoritma klasifikasi yaitu decision tree dengan P value of t-Test sebesar 0,0184 dan k-nearest neighbor dengan P value of t-Test sebesar 0,0292, akan tetapi tidak signifikan pada naive bayes dengan P value of t-Test sebesar 0,9236. Setelah diterapkan teknik bagging pada lima dataset medis, naive bayes memiliki akurasi paling tinggi untuk dataset breast-cancer sebesar 96,14% dengan AUC sebesar 0,984, heart-disease sebesar 84,44% dengan AUC sebesar 0,911dan pima-diabetes sebesar 74,73% dengan AUC sebesar 0,806. Sedangkan k-nearest neighbor memiliki akurasi yang paling baik untuk dataset liver-disorder sebesar 62,03% dengan AUC sebesar dan 0,632 dan vertebral column dengan akurasi sebesar 82,26% dengan AUC sebesar 0,867. Kata Kunci: teknik ensemble, bagging, ketidakseimbangan kelas, dataset medis.

[1]  Yuxin Peng,et al.  AdaOUBoost: adaptive over-sampling and under-sampling to boost the concept learning in large scale imbalanced data sets , 2010, MIR '10.

[2]  Bhekisipho Twala,et al.  Multiple classifier application to credit risk assessment , 2010, Expert Syst. Appl..

[3]  Simon Fong,et al.  An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets , 2013, DaEng.

[4]  Wei Liu,et al.  A Novel Improved SMOTE Resampling Algorithm Based on Fractal , 2011 .

[5]  Matías Gámez,et al.  adabag: An R Package for Classification with Boosting and Bagging , 2013 .

[6]  Naveen Kumar Korada Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Maize Expert System , 2012 .

[7]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[8]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[9]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[10]  H. Kashima,et al.  Roughly balanced bagging for imbalanced data , 2009 .

[11]  Jian-Jiun Ding,et al.  Facial age estimation based on label-sensitive learning and age-oriented regression , 2013, Pattern Recognit..

[12]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[13]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[14]  Chengqi Zhang,et al.  Empirical Study of Bagging Predictors on Medical Data , 2011, AusDM.

[15]  Myoung-Jong Kim,et al.  Classifiers selection in ensembles using genetic algorithms for bankruptcy prediction , 2012, Expert Syst. Appl..

[16]  Taghi M. Khoshgoftaar,et al.  Resampling or Reweighting: A Comparison of Boosting Implementations , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.