Synthetic Minority ov er-Sampling Technique (Smote) for Predicting so Ftware Build Outcomes

In this research we use a data stream approach to mining data and construct Decision Tree models that predict software build outcomes in terms of software metrics that are derived from source code used in the software construction process. The rationale for using the data stream approach was to track the evolution of the prediction model over time as builds are incrementally constructed from previous versions either to remedy errors or to enhance functionality. As the volume of data available for mining from the software repository that we used was limited, we synthesized new data instances through the application of the SMOTE oversampling algorithm. The results indicate that a small number of the available metrics have significance for prediction software build outcomes. It is observed that classification accuracy steadily improves after approximately 900 instances of builds have been fed to the classifier. At the end of the data streaming process classification accuracies of 80% were achieved, though some bias arises due to the distribution of data across the two classes over time.

[1]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[2]  Zhi-Hua Zhou,et al.  Software defect detection with rocus , 2011 .

[3]  Jonathan I. Maletic,et al.  Journal of Software Maintenance and Evolution: Research and Practice Survey a Survey and Taxonomy of Approaches for Mining Software Repositories in the Context of Software Evolution , 2022 .

[4]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Kenneth Baclawski,et al.  Quickly generating billion-record synthetic databases , 1994, SIGMOD '94.

[7]  Alexander Serebrenik,et al.  Process Mining Software Repositories , 2011, 2011 15th European Conference on Software Maintenance and Reengineering.

[8]  Gabriele Manduchi,et al.  Measuring software evolution at a nuclear fusion experiment site: a test case for the applicability of OO and reuse metrics in software characterization , 2002, Inf. Softw. Technol..

[9]  Philip S. Yu,et al.  Mining Frequent Patterns in Data Streams at Multiple Time Granularities , 2002 .

[10]  S. Dick,et al.  Applying Novel Resampling Strategies To Software Defect Prediction , 2007, NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society.

[11]  Yasufumi Takama,et al.  Misclassification analysis for the class imbalance problem , 2010, 2010 World Automation Congress.

[12]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[13]  Bruce Christianson,et al.  Using the Support Vector Machine as a Classification Method for Software Defect Prediction with Static Code Metrics , 2009, EANN.

[14]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[15]  Russel Pears,et al.  Data stream mining for predicting software build outcomes using source code metrics , 2014, Inf. Softw. Technol..

[16]  Andreas Zeller,et al.  Mining the Jazz repository: Challenges and opportunities , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[17]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[18]  Daniela E. Damian,et al.  Predicting build failures using social network analysis on developer communication , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[19]  Akito Monden,et al.  The Effects of Over and Under Sampling on Fault-prone Module Detection , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[20]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[21]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[22]  Taghi M. Khoshgoftaar,et al.  Predicting Faults in High Assurance Software , 2010, 2010 IEEE 12th International Symposium on High Assurance Systems Engineering.

[23]  A.E. Hassan,et al.  The road ahead for Mining Software Repositories , 2008, 2008 Frontiers of Software Maintenance.

[24]  Taghi M. Khoshgoftaar,et al.  Impact of Data Sampling on Stability of Feature Selection for Software Measurement Data , 2011, 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence.

[25]  Russel Pears,et al.  Mining developer communication data streams , 2014, ArXiv.

[26]  Russel Pears,et al.  Mining Software Metrics from Jazz , 2011, 2011 Ninth International Conference on Software Engineering Research, Management and Applications.

[27]  Albert Bifet,et al.  Adaptive learning and mining for data streams and frequent patterns , 2009, SKDD.