On software engineering repositories and their open problems

In the last decade, a large number of software repositories have been created for different purposes. In this paper we present a survey of the publicly available repositories and classify the most common ones as well as discussing the problems faced by researchers when applying machine learning or statistical techniques to them.

[1]  Rüdiger Lincke,et al.  Comparing software metrics tools , 2008, ISSTA '08.

[2]  D HerbslebJames,et al.  Two case studies of open source software development , 2002 .

[3]  Elena García Barriocanal,et al.  Empirical findings on team size and productivity in software development , 2012, J. Syst. Softw..

[4]  Jesús M. González-Barahona,et al.  On the reproducibility of empirical software engineering studies based on data retrieved from development repositories , 2011, Empirical Software Engineering.

[5]  T. Wright,et al.  Organizational Benchmarking Using the ISBSG Data Repository , 2001, IEEE Softw..

[6]  John A. Clark,et al.  Metrics are fitness functions too , 2004, 10th International Symposium on Software Metrics, 2004. Proceedings..

[7]  Martin J. Shepperd,et al.  Comparing Software Prediction Techniques Using Simulation , 2001, IEEE Trans. Software Eng..

[8]  Günther Ruhe,et al.  Search Based Software Engineering , 2013, Lecture Notes in Computer Science.

[9]  Barbara A. Kitchenham,et al.  The role of replications in empirical software engineering—a word of warning , 2008, Empirical Software Engineering.

[10]  Jesús S. Aguilar-Ruiz,et al.  Searching for rules to detect defective modules: A subgroup discovery approach , 2012, Inf. Sci..

[11]  Kevin Crowston,et al.  FLOSSmole: A Collaborative Repository for FLOSS Research Data and Analyses , 2006, Int. J. Inf. Technol. Web Eng..

[12]  Gregg Rothermel,et al.  Supporting Controlled Experimentation with Testing Techniques: An Infrastructure and its Potential Impact , 2005, Empirical Software Engineering.

[13]  Andreas Zeller,et al.  Change Bursts as Defect Predictors , 2010, 2010 IEEE 21st International Symposium on Software Reliability Engineering.

[14]  Michele Lanza,et al.  An extensive comparison of bug prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[15]  Ian Witten,et al.  Data Mining , 2000 .

[16]  Daniel Izquierdo-Cortazar,et al.  FLOSSMetrics: Free/Libre/Open Source Software Metrics , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[17]  Thomas Zimmermann,et al.  Information needs for software development analytics , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[18]  Martin Shepperd,et al.  Case and Feature Subset Selection in Case-Based Software Project Effort Prediction , 2003 .

[19]  Hongyu Zhang On the Distribution of Software Faults , 2008, IEEE Transactions on Software Engineering.

[20]  Jing Li,et al.  The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies , 2010, 2010 Asia Pacific Software Engineering Conference.

[21]  Michele Marchesi,et al.  On the Distribution of Bugs in the Eclipse System , 2011, IEEE Transactions on Software Engineering.

[22]  Sushil Krishna Bajracharya,et al.  Sourcerer: mining and searching internet-scale software repositories , 2008, Data Mining and Knowledge Discovery.

[23]  Barry W. Boehm,et al.  Finding the right data for software cost modeling , 2005, IEEE Software.

[24]  Burak Turhan,et al.  On the dataset shift problem in software engineering prediction models , 2011, Empirical Software Engineering.

[25]  Anil K. Jain,et al.  Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[27]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[28]  J. Herbsleb,et al.  Two case studies of open source software development: Apache and Mozilla , 2002, TSEM.

[29]  Michele Lanza,et al.  Evaluating defect prediction approaches: a benchmark and an extensive comparison , 2011, Empirical Software Engineering.

[30]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[31]  Stephen G. MacDonell,et al.  Evaluating prediction systems in software project estimation , 2012, Inf. Softw. Technol..

[32]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[33]  Jesús M. González-Barahona,et al.  Tools for the Study of the Usual Data Sources found in Libre Software Projects , 2009, Int. J. Open Source Softw. Process..

[34]  Rajesh Vasa,et al.  Growth and Change Dynamics in Open Source Software Systems , 2010 .

[35]  Francisco Herrera,et al.  Addressing the Classification with Imbalanced Data: Open Problems and New Challenges on Class Distribution , 2011, HAIS.