Detecting Outliers in Terms of Errors in Embedded Software Development Projects Using Imbalanced Data Classification

This study examines the effect of undersampling on the detection of outliers in terms of the number of errors in embedded software development projects. Our study aims at estimating the number of errors and the amount of effort in projects. As outliers can adversely affect this estimation, they are excluded from many estimation models. However, such outliers can be identified in practice once the projects have been completed; therefore, they should not be excluded while constructing models and estimating errors or effort. We have also attempted to detect outliers. However, the accuracy of the classifications was not acceptable because of a small number of outliers. This problem is referred to as data imbalance. To avoid this problem, we explore rebalancing methods using k-means cluster-based undersampling. This method aims at improving the proportion of outliers that are correctly identified while maintaining the other classification performance metrics high. Evaluation experiments were performed, and the results show that the proposed methods can improve the accuracy of detecting outliers; however, they also classify too many samples as outliers.

[1]  Nuno Vasconcelos,et al.  Risk minimization, probability elicitation, and cost-sensitive SVMs , 2010, ICML.

[2]  Naohiro Ishii,et al.  Estimating Interval of the Number of Errors for Embedded Software Development Projects , 2014, 2014 IIAI 3rd International Conference on Advanced Applied Informatics.

[3]  Naohiro Ishii,et al.  Estimating Interval of the Number of Errors for Embedded Software Development Projects , 2014, 2014 IIAI 3rd International Conference on Advanced Applied Informatics.

[4]  Naohiro Ishii,et al.  Bin-based estimation of the amount of effort for embedded software development projects with support vector machines , 2016 .

[5]  Michael D. Gordon,et al.  Recall-precision trade-off: A derivation , 1989, JASIS.

[6]  Naohiro Ishii,et al.  Improving Accuracy of an Artificial Neural Network Model to Predict Effort and Errors in Embedded Software Development Projects , 2010 .

[7]  Naohiro Ishii,et al.  Error Estimation Models Integrating Previous Models and Using Artificial Neural Networks for Embedded Software Development Projects , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[8]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[9]  Naohiro Ishii,et al.  Effort Estimation for Embedded Software Development Projects by Combining Machine Learning with Classification , 2016, 2016 4th Intl Conf on Applied Computing and Information Technology/3rd Intl Conf on Computational Science/Intelligence and Applied Informatics/1st Intl Conf on Big Data, Cloud Computing, Data Science & Engineering (ACIT-CSII-BCD).

[10]  José Salvador Sánchez,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[11]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .