Data stream mining for predicting software build outcomes using source code metrics

Context: Software development projects involve the use of a wide range of tools to produce a software artifact. Software repositories such as source control systems have become a focus for emergent research because they are a source of rich information regarding software development projects. The mining of such repositories is becoming increasingly common with a view to gaining a deeper understanding of the development process. Objective: This paper explores the concepts of representing a software development project as a process that results in the creation of a data stream. It also describes the extraction of metrics from the Jazz repository and the application of data stream mining techniques to identify useful metrics for predicting build success or failure. Method: This research is a systematic study using the Hoeffding Tree classification method used in conjunction with the Adaptive Sliding Window (ADWIN) method for detecting concept drift by applying the Massive Online Analysis (MOA) tool. Results: The results indicate that only a relatively small number of the available measures considered have any significance for predicting the outcome of a build over time. These significant measures are identified and the implication of the results discussed, particularly the relative difficulty of being able to predict failed builds. The Hoeffding Tree approach is shown to produce a more stable and robust model than traditional data mining approaches. Conclusion: Overall prediction accuracies of 75% have been achieved through the use of the Hoeffding Tree classification method. Despite this high overall accuracy, there is greater difficulty in predicting failure than success. The emergence of a stable classification tree is limited by the lack of data but overall the approach shows promise in terms of informing software development activities in order to minimize the chance of failure.

[1]  Andreas Zeller,et al.  Mining metrics to predict component failures , 2006, ICSE.

[2]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[3]  Shane Markstrum,et al.  Proceedings of the 3rd ACM SIGPLAN workshop on Evaluation and usability of programming languages and tools , 2011, SPLASH 2011.

[4]  Harald C. Gall,et al.  Mining Software Evolution to Predict Refactoring , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[5]  Abraham Bernstein,et al.  Detecting similar Java classes using tree algorithms , 2006, MSR '06.

[6]  Russel Pears,et al.  Mining Software Metrics from Jazz , 2011, 2011 Ninth International Conference on Software Engineering Research, Management and Applications.

[7]  Heejun Park,et al.  An empirical validation of a neural network model for software effort estimation , 2008, Expert Syst. Appl..

[8]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[9]  Harald C. Gall,et al.  Comparing fine-grained source code changes and code churn for bug prediction , 2011, MSR '11.

[10]  Robyn R. Lutz,et al.  Are change metrics good predictors for an evolving software product line? , 2011, Promise '11.

[11]  Yann-Gaël Guéhéneuc,et al.  Automatic Generation of Detection Algorithms for Design Defects , 2006, 21st IEEE/ACM International Conference on Automated Software Engineering (ASE'06).

[12]  Günther Ruhe,et al.  Software Engineering Decision Support ? A New Paradigm for Learning Software Organizations , 2002, LSO.

[13]  Witold Pedrycz,et al.  A Model to Identify Refactoring Effort during Maintenance by Mining Source Code Repositories , 2008, PROFES.

[14]  Diomidis Spinellis,et al.  Data mining in software engineering , 2011, Intell. Data Anal..

[15]  Ingrid Russell,et al.  An introduction to the WEKA data mining system , 2006, ITICSE '06.

[16]  Vivek Agarwal,et al.  Survey on Classification Techniques for Data Mining , 2015 .

[17]  Witold Pedrycz,et al.  Data Mining: A Knowledge Discovery Approach , 2007 .

[18]  Thomas Seidl,et al.  MOA: A Real-Time Analytics Open Source Framework , 2011, ECML/PKDD.

[19]  Iman Keivanloo,et al.  A Linked Data platform for mining software repositories , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[20]  PearsRussel,et al.  Data stream mining for predicting software build outcomes using source code metrics , 2014 .

[21]  Chao Liu,et al.  Data Mining for Software Engineering , 2009, Computer.

[22]  Gabriele Manduchi,et al.  Measuring software evolution at a nuclear fusion experiment site: a test case for the applicability of OO and reuse metrics in software characterization , 2002, Inf. Softw. Technol..

[23]  Joseph Gil,et al.  How much information do software metrics contain? , 2011, PLATEAU '11.

[24]  Rashmi Data Mining: A Knowledge Discovery Approach , 2012 .

[25]  Xindong Wu,et al.  The Top Ten Algorithms in Data Mining , 2009 .

[26]  Daniela E. Damian,et al.  Does distance still matter? , 2008, Softw. Process. Improv. Pract..

[27]  Massimiliano Di Penta Mining developers' communication to assess software quality: promises, challenges, perils , 2012, WETSoM '12.

[28]  Taghi M. Khoshgoftaar,et al.  Choosing software metrics for defect prediction: an investigation on feature selection techniques , 2011, Softw. Pract. Exp..

[29]  Jacquiline Finlay,et al.  Multi-metric prediction of software build outcomes , 2012 .

[30]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[31]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[32]  Gerardo Canfora,et al.  Impact analysis by mining software and change request repositories , 2005, 11th IEEE International Software Metrics Symposium (METRICS'05).

[33]  Andreas Zeller,et al.  Mining the Jazz repository: Challenges and opportunities , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[34]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[35]  Thomas Zimmermann,et al.  Analytics for software development , 2010, FoSER '10.

[36]  John Elder,et al.  Handbook of Statistical Analysis and Data Mining Applications , 2009 .

[37]  Daniela Damian,et al.  Does distance still matter , 2008 .

[38]  Tao Xie,et al.  Software intelligence: the future of mining software engineering data , 2010, FoSER '10.

[39]  Andy M. Connor,et al.  Predicting software build failure using source code metrics , 2011 .

[40]  H. D. Rombach,et al.  The Goal Question Metric Approach , 1994 .

[41]  Tim Menzies,et al.  Mining Repositories to Assist in Project Planning and Resource Allocation , 2004, MSR.

[42]  Victor R. Basili,et al.  A Validation of Object-Oriented Design Metrics as Quality Indicators , 1996, IEEE Trans. Software Eng..

[43]  Jian Pei,et al.  Mining Software Engineering Data , 2007, ICSE Companion.

[44]  Harald C. Gall,et al.  Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[45]  A. Bifet,et al.  Early Drift Detection Method , 2005 .

[46]  Sandro Morasca,et al.  Deriving models of software fault-proneness , 2002, SEKE '02.

[47]  Andy M. Connor Mining Software Metrics from the Jazz Repository , 2011 .

[48]  Hareton K. N. Leung,et al.  Mining Static Code Metrics for a Robust Prediction of Software Defect-Proneness , 2011, 2011 International Symposium on Empirical Software Engineering and Measurement.

[49]  M. M. Naidu,et al.  An Algorithm for Classification in Data Mining Based on Classification Codes , 2007, IMECS.

[50]  John Wang,et al.  Data Mining Software , 2008 .

[51]  Christos Faloutsos,et al.  Detecting Fraudulent Personalities in Networks of Online Auctioneers , 2006, PKDD.

[52]  Roberto da Silva Bigonha,et al.  Identifying thresholds for object-oriented software metrics , 2012, J. Syst. Softw..

[53]  Daniela E. Damian,et al.  Predicting build failures using social network analysis on developer communication , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[54]  Bart Baesens,et al.  Mining software repositories for comprehensible software fault prediction models , 2008, J. Syst. Softw..

[55]  Andreas Zeller,et al.  How Long Will It Take to Fix This Bug? , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[56]  Audris Mockus,et al.  Predicting risk of software changes , 2000, Bell Labs Technical Journal.

[57]  Alexander Serebrenik,et al.  By no means: a study on aggregating software metrics , 2011, WETSoM '11.

[58]  Rachel Harrison,et al.  On software engineering repositories and their open problems , 2012, 2012 First International Workshop on Realizing AI Synergies in Software Engineering (RAISE).

[59]  Albert Bifet,et al.  Adaptive learning and mining for data streams and frequent patterns , 2009, SKDD.

[60]  Geoff Holmes,et al.  Handling Numeric Attributes in Hoeffding Trees , 2008, PAKDD.