Antecedents of open source software defects: A data mining approach to model formulation, validation and testing

This paper develops tests and validates a model for the antecedents of open source software (OSS) defects, using Data and Text Mining. The public archives of OSS projects are used to access historical data on over 5,000 active and mature OSS projects. Using domain knowledge and exploratory analysis, a wide range of variables is identified from the process, product, resource, and end-user characteristics of a project to ensure that the model is robust and considers all aspects of the system. Multiple Data Mining techniques are used to refine the model and data is enriched by the use of Text Mining for knowledge discovery from qualitative information. The study demonstrates the suitability of Data Mining and Text Mining for model building. Results indicate that project type, end-user activity, process quality, team size and project popularity have a significant impact on the defect density of operational OSS projects. Since many organizations, both for profit and not for profit, are beginning to use Open Source Software as an economic alternative to commercial software, these results can be used in the process of deciding what software can be reasonably maintained by an organization.

[1]  Bennet P. Lientz,et al.  Software Maintenance Management: A Study of the Maintenance of Computer Application Software in 487 Data Processing Organizations , 1980 .

[2]  James M. Bieman,et al.  The FreeBSD project: a replication case study of open source development , 2005, IEEE Transactions on Software Engineering.

[3]  Roger S Pressman Software Engineering: A Practitioner's Approach with Bonus Chapter on Agile Development , 2003 .

[4]  Daniel M. Germán An Empirical Study of Fine-Grained Software Modifications , 2004, ICSM.

[5]  Walt Scacchi,et al.  Understanding Open Source Software Evolution: Applying, Breaking, and Rethinking the Laws of Software Evolution , 2003 .

[6]  Mayuram S. Krishnan,et al.  Evaluating the cost of software quality , 1998, CACM.

[7]  Barry W. Boehm,et al.  Improving Software Productivity , 1987, Computer.

[8]  Henri Barki,et al.  Explaining the Role of User Participation in Information System Use , 1994 .

[9]  Rajiv D. Banker,et al.  Software Errors and Software Maintenance Management , 2002, Inf. Technol. Manag..

[10]  U. Ligges Review of An R and S-PLUS companion to applied regression by J. Fox, Sage Publications, Thousand Oaks, California 2002 , 2003 .

[11]  Padmanabhan Santhanam,et al.  Exploring defect data from development and customer usage on software modules over multiple releases , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[12]  Monica Chiarini Tremblay,et al.  Utilizing Text Mining Techniques to Identify Fall Related Injuries , 2005, AMCIS.

[13]  Qiang Tu,et al.  Growth, evolution, and structural change in open source software , 2001, IWPSE '01.

[14]  Christopher L. Huntley,et al.  Organizational learning in open-source software projects: an analysis of debugging data , 2003, IEEE Trans. Engineering Management.

[15]  E. Burton Swanson,et al.  System Life Expectancy and the Maintenance Effort: Exploring Their Equilibration , 2000, MIS Q..

[16]  Walt Scacchi,et al.  Free and open source development practices in the game community , 2004, IEEE Software.

[17]  Mayuram S. Krishnan,et al.  Effects of Process Maturity on Quality, Cycle Time, and Effort in Software Product Development , 2000 .

[18]  Andreas Zeller,et al.  Mining Version Histories to Guide Software Changes , 2004 .

[19]  Antony Bryant,et al.  Grounded Theory in Historical Perspective: An Epistemological Account , 2007 .

[20]  Uzma Raja,et al.  Modeling software evolution defects: a time series approach , 2009 .

[21]  Norman E. Fenton,et al.  Software Metrics: A Rigorous Approach , 1991 .

[22]  Mary Shaw,et al.  Empirical evaluation of defect projection models for widely-deployed production software systems , 2004, SIGSOFT '04/FSE-12.

[23]  Douglas C. Schmidt,et al.  Software patterns , 1996, CACM.

[24]  Giancarlo Succi,et al.  An empirical study of open-source and closed-source software products , 2004, IEEE Transactions on Software Engineering.

[25]  Harvey P. Siy,et al.  Predicting Fault Incidence Using Software Change History , 2000, IEEE Trans. Software Eng..

[26]  Charles Chowa,et al.  Information System Success: Individual and Organizational Determinants , 2006, Manag. Sci..

[27]  D HerbslebJames,et al.  Two case studies of open source software development , 2002 .

[28]  Daniel M. Germán,et al.  An empirical study of fine-grained software modifications , 2004, 20th IEEE International Conference on Software Maintenance, 2004. Proceedings..

[29]  Sunita Chulani,et al.  Metrics for managing customer view of software quality , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[30]  Kevin Crowston,et al.  Defining Open Source Software Project Success , 2003, ICIS.

[31]  Steve McConnell,et al.  Software Engineering Principles , 1999, IEEE Software.

[32]  Michael R. Lyu,et al.  What is software reliability? , 1994, Proceedings of COMPASS'94 - 1994 IEEE 9th Annual Conference on Computer Assurance.

[33]  Audris Mockus,et al.  Understanding and predicting effort in software projects , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[34]  Robert E. Park,et al.  Software Size Measurement: A Framework for Counting Source Statements , 1992 .

[35]  Sharif H. Melouk,et al.  Managing Resource Allocation and Task Prioritization Decisions in Large Scale Virtual Collaborative Development Projects , 2010, Inf. Resour. Manag. J..

[36]  K. Charmaz,et al.  The sage handbook of grounded theory , 2007 .

[37]  Gordon B. Davis,et al.  Software Development Practices, Software Complexity, and Software Maintenance Performance: a Field Study , 1998 .

[38]  Jane Greenberg,et al.  Who is an open source software developer? , 2002, CACM.

[39]  Lee L. Gremillion Determinants of program repair maintenance requirements , 1984, CACM.

[40]  Stefan Koch Software evolution in open source projects—a large-scale investigation , 2007 .

[41]  Sandra Slaughter,et al.  Understanding the Motivations, Participation, and Performance of Open Source Software Developers: A Longitudinal Study of the Apache Projects , 2006, Manag. Sci..

[42]  Brian Fitzgerald,et al.  Understanding open source software development , 2002 .

[43]  Philipp J. H. Schröder A Statistical Analysis of Defects in Debian and Strategies for Improving Quality in Free Software Projects , 2006 .

[44]  Karim R. Lakhani,et al.  Community, Joining, and Specialization in Open Source Software Innovation: A Case Study , 2003 .

[45]  Walt Scacchi,et al.  Collaboration, Leadership, Control, and Conflict Negotiation and the Netbeans.org Open Source Software Development Community , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[46]  Fred P. Brooks,et al.  The Mythical Man-Month , 1975, Reliable Software.

[47]  J. Herbsleb,et al.  Two case studies of open source software development: Apache and Mozilla , 2002, TSEM.

[48]  Chandrasekar Subramaniam,et al.  Determinants of open source software project success: A longitudinal study , 2009, Decis. Support Syst..

[49]  Blake Ives,et al.  User Involvement and MIS Success: A Review of Research , 1984 .

[50]  K. Vairavan,et al.  An Experimental Investigation of Software Metrics and Their Relationship to Software Development Effort , 1989, IEEE Trans. Software Eng..

[51]  Chadd C. Williams,et al.  Automatic mining of source code repositories to improve bug finding techniques , 2005, IEEE Transactions on Software Engineering.

[52]  Hoang Pham Software Reliability , 1999 .

[53]  Gail C. Murphy,et al.  Predicting source code changes by mining change history , 2004, IEEE Transactions on Software Engineering.

[54]  A SlaughterSandra,et al.  Evaluating the cost of software quality , 1998 .

[55]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[56]  Jesús M. González-Barahona,et al.  Evolution and growth in large libre software projects , 2005, Eighth International Workshop on Principles of Software Evolution (IWPSE'05).

[57]  Katerina Goseva-Popstojanova,et al.  Architecture-based approach to reliability assessment of software systems , 2001, Perform. Evaluation.

[58]  Tarek K. Abdel-Hamid,et al.  Investigating the impacts of managerial turnover/succession on software project performance , 1992 .

[59]  M. E. Conway HOW DO COMMITTEES INVENT , 1967 .

[60]  Hans C. Jessen,et al.  Applied Logistic Regression Analysis , 1996 .

[61]  James D. Herbsleb,et al.  Software quality and the Capability Maturity Model , 1997, CACM.

[62]  Keith H. Bennett,et al.  Software maintenance and evolution: a roadmap , 2000, ICSE '00.

[63]  E. B. Swanson,et al.  Software maintenance management , 1980 .

[64]  Likoebe M. Maruping,et al.  Impacts of License Choice and Organizational Sponsorship on User Interest and Development Activity in Open Source Software Projects , 2006, Inf. Syst. Res..

[65]  Rajiv D. Banker,et al.  A model to evaluate variables impacting the productivity of software maintenance projects , 1991 .

[66]  Walt Scacchi,et al.  Data Mining for Software Process Discovery in Open Source Software Development Communities , 2004, MSR.

[67]  Ioannis Stamelos,et al.  Code quality analysis in open source software development , 2002, Inf. Syst. J..

[68]  Jr. Frederick P. Brooks,et al.  The mythical man-month (anniversary ed.) , 1995 .

[69]  Henri Barki,et al.  Measuring User Participation, User Involvement, and User Attitude , 1994, MIS Q..

[70]  Serge Demeyer,et al.  Software Evolution , 2010 .

[71]  Marvin V. Zelkowitz,et al.  Principles of software engineering and design , 1979 .

[72]  J. Herbsleb,et al.  Global software development , 2001 .

[73]  Eric S. Raymond,et al.  The Cathedral and the Bazaar , 2000 .

[74]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[75]  Chris F. Kemerer,et al.  Environmental Volatility, Development Decisions, and Software Volatility: A Longitudinal Analysis , 2006, Manag. Sci..

[76]  Norman E. Fenton,et al.  A Critique of Software Defect Prediction Models , 1999, IEEE Trans. Software Eng..

[77]  L. Erlikh,et al.  Leveraging legacy system dollars for e-business , 2000 .

[78]  Norman F. Schneidewind,et al.  IEEE Standard For A Software Quality Metrics Methodology Revision And Reaffirmation , 1997, Proceedings of IEEE International Symposium on Software Engineering Standards.

[79]  Ust Beijing,et al.  Data Mining and Knowledge Discovery in Databases , 1999 .

[80]  Taghi M. Khoshgoftaar,et al.  A comparative study of predictive models for program changes during system testing and maintenance , 1993, 1993 Conference on Software Maintenance.

[81]  Tarek K. Abdel-Hamid,et al.  The Dynamics of Software Project Staffing: A System Dynamics Based Simulation Approach , 1989, IEEE Trans. Software Eng..

[82]  Penny Grubb,et al.  Software maintenance , 1996 .