Pendekatan Level Data untuk Menangani Ketidakseimbangan Kelas pada Prediksi Cacat Software

Dataset software metrics secara umum bersifat tidak seimbang, hal ini dapat menurunkan kinerja model prediksi cacat software karena cenderung menghasilkan prediksi kelas mayoritas. Secara umum ketidakseimbangan kelas dapat ditangani dengan dua pendekatan, yaitu level data dan level algoritma. Pendekatan level data ditujukan untuk memperbaiki keseimbangan kelas, sedangkan pendekatan level algoritma ditujukan untuk memperbaiki algoritma atau menggabungkan ( ensemble ) pengklasifikasi agar lebih konduktif terhadap kelas minoritas. Pada penelitian ini diusulkan pendekatan level data dengan resampling , yaitu random oversampling (ROS), dan random undersampling (RUS), dan mensintesis menggunakan algoritma FSMOTE. Pengklasifikasi yang digunakan adalah Naϊve Bayes.  Hasil penelitian menunjukkan bahwa model FSMOTE+NB merupakan model pendekatan level data terbaik pada prediksi cacat software karena nilai sensitivitas dan G-Mean model FSMOTE+NB meningkat secara signifikan, sedangkan model ROS+NB dan RUS+NB tidak meningkat secara signifikan.

[1]  Taghi M. Khoshgoftaar,et al.  Building Useful Models from Imbalanced Data with Sampling and Boosting , 2008, FLAIRS.

[2]  Vasile Palade,et al.  Efficient resampling methods for training support vector machines with imbalanced datasets , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[3]  Yunqian Ma,et al.  Class Imbalance and Active Learning , 2013 .

[4]  Yuxin Peng,et al.  AdaOUBoost: adaptive over-sampling and under-sampling to boost the concept learning in large scale imbalanced data sets , 2010, MIR '10.

[5]  Qinbao Song,et al.  A General Software Defect-Proneness Prediction Framework , 2011, IEEE Transactions on Software Engineering.

[6]  S. Ertekin CLASS IMBALANCE AND ACTIVE LEARNING , 2013 .

[7]  Tao Wang,et al.  Naive Bayes Software Defect Prediction Model , 2010, 2010 International Conference on Computational Intelligence and Software Engineering.

[8]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[9]  Francisco Herrera,et al.  On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed , 2014, Inf. Sci..

[10]  Paul M. Thompson,et al.  Analysis of sampling techniques for imbalanced data: An n=648 ADNI study , 2014, NeuroImage.

[11]  Simon Fong,et al.  An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets , 2013, DaEng.

[12]  Ingunn Myrtveit,et al.  Reliability and validity in comparative studies of software prediction models , 2005, IEEE Transactions on Software Engineering.

[13]  Wei Liu,et al.  A Novel Improved SMOTE Resampling Algorithm Based on Fractal , 2011 .

[14]  Casper Lassenius,et al.  Perceived causes of software project failures - An analysis of their relationships , 2014, Inf. Softw. Technol..

[15]  Sabrina Ahmad,et al.  Metaheuristic Optimization based Feature Selection for Software Defect Prediction , 2014, J. Softw..

[16]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[17]  Taghi M. Khoshgoftaar,et al.  An empirical study of the classification performance of learners on imbalanced and noisy software quality data , 2014, Inf. Sci..

[18]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[19]  Cagatay Catal,et al.  Performance Evaluation Metrics for Software Fault Prediction Studies , 2012 .

[20]  Ayse Basar Bener,et al.  Software Defect Prediction: Heuristics for Weighted Naïve Bayes , 2007, ICSOFT.

[21]  Haibo He,et al.  Assessment Metrics for Imbalanced Learning , 2013 .

[22]  Max Bramer,et al.  Principles of Data Mining , 2013, Undergraduate Topics in Computer Science.

[23]  Qinbao Song,et al.  Data Quality: Some Comments on the NASA Software Defect Datasets , 2013, IEEE Transactions on Software Engineering.

[24]  Zhi-Hua Zhou,et al.  Ensemble Methods for Class Imbalance Learning , 2013 .

[25]  Monica Chis Evolutionary Decision Trees and Software Metrics for Module Defects Identification , 2008 .

[26]  Gregory W. Corder,et al.  Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach , 2009 .

[27]  Ross T. Smith,et al.  The Practical Guide to Defect Prevention , 2007 .

[28]  Xin Yao,et al.  Using Class Imbalance Learning for Software Defect Prediction , 2013, IEEE Transactions on Reliability.

[29]  Ashkan Sami,et al.  Effective Estimation of Modules' Metrics in Software Defect Prediction , 2009 .

[30]  Bruce Christianson,et al.  The misuse of the NASA metrics data program data sets for automated software defect prediction , 2011, EASE.

[31]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[32]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[33]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[34]  Yunqian Ma,et al.  Foundations of Imbalanced Learning , 2013 .

[35]  Huaxiang Zhang,et al.  A Normal Distribution-Based Over-Sampling Approach to Imbalanced Data Classification , 2011, ADMA.

[36]  Taghi M. Khoshgoftaar,et al.  Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[37]  Maurizio A Strangio Recent Advances in Technologies , 2009 .

[38]  Barry W. Boehm,et al.  What we have learned about fighting defects , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[39]  Florin Gorunescu,et al.  Data Mining - Concepts, Models and Techniques , 2011, Intelligent Systems Reference Library.

[40]  José Javier Dolado,et al.  Preliminary comparison of techniques for dealing with imbalance in software defect prediction , 2014, EASE '14.

[41]  Andreas Zeller,et al.  How Long Will It Take to Fix This Bug? , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[42]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[43]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[44]  Barry W. Boehm,et al.  A quality-based cost estimation model for the product line life cycle , 2006, CACM.

[45]  Robert H. Carver,et al.  Doing Data Analysis with SPSS Version 18.0 , 2008 .

[46]  R. Chitra,et al.  Performance Analysis of Datamining Algorithms for Software Quality Prediction , 2009, 2009 International Conference on Advances in Recent Technologies in Communication and Computing.