Classification of software patches: a text mining approach

Installation of maintenance patches in operational software systems is a source of significant expenditure and resource consumption. Managers often have to find a balance between publicly announced vulnerabilities and/or possible destabilization of existing applications, while making decisions regarding patch roll out to all systems. We propose a classification scheme for maintenance patches and examine the effects of patch category on the internal characteristics of a software system. Text mining the patch releases of 77 successive versions of the Linux operating system, we extend previous categorization schemes to maintenance patches. This granularity level offers a view of the aggregate nature of the tasks performed in each version. An unsupervised learning technique, cluster analysis associated with Text mining, reveals that there are three identifiable categories in Linux patch files. Based on the maintenance keywords in each category, we label them as: corrective, perfective and adaptive patches. Further analysis of the effects of patch category on the structural complexity and the time to next release indicates that perfective patches are associated with a reduction in the complexity and frequency of patch release. Categorization at the patch level is useful for managers, since changes made to operational software systems are through patches. Determining the nature of a patch can assist managers in planning version roll out and testing criterion. Copyright © 2010 John Wiley & Sons, Ltd.

[1]  Uzma Raja,et al.  Investigating quality in large-scale Open Source Software , 2005 .

[2]  Qiang Tu,et al.  Growth, evolution, and structural change in open source software , 2001, IWPSE '01.

[3]  E. Burton Swanson,et al.  The dimensions of maintenance , 1976, ICSE '76.

[4]  Monica Chiarini Tremblay,et al.  Utilizing Text Mining Techniques to Identify Fall Related Injuries , 2005, AMCIS.

[5]  Michael W. Berry,et al.  Survey of Text Mining , 2003, Springer New York.

[6]  S. C. Pearce,et al.  Handbook of Statistics, Vol. 1: Analysis of Variance. , 1982 .

[7]  Chris F. Kemerer,et al.  Environmental Volatility, Development Decisions, and Software Volatility: A Longitudinal Analysis , 2006, Manag. Sci..

[8]  Stephen R. Schach,et al.  Maintainability of the Linux kernel , 2002, IEE Proc. Softw..

[9]  Michael D. Ernst,et al.  Which warnings should I fix first? , 2007, ESEC-FSE '07.

[10]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[11]  Meir M. Lehman,et al.  Software Evolution and Software Evolution Processes , 2002, Ann. Softw. Eng..

[12]  Chris F. Kemerer,et al.  Determinants of software maintenance profiles: an empirical investigation , 1997, J. Softw. Maintenance Res. Pract..

[13]  Meir M. Lehman,et al.  An Introduction to growth dynamics , 1972, Statistical Computer Performance Evaluation.

[14]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[15]  Jon Heales,et al.  Factors affecting information system volatility , 2000, ICIS.

[16]  N. Schneidewind,et al.  Towards an Ontology of software maintenance , 1999 .

[17]  Ned Chapin,et al.  Types of software evolution and software maintenance , 2001, J. Softw. Maintenance Res. Pract..

[18]  Chris F. Kemerer,et al.  An Empirical Approach to Studying Software Evolution , 1999, IEEE Trans. Software Eng..

[19]  Jeffrey L. Goldberg,et al.  CDM: an approach to learning in text categorization , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[20]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[21]  Stephen M. Scariano,et al.  The effects of violations of independence assumptions in the one-way ANOVA , 1987 .

[22]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[23]  Narasimhaiah Gorla,et al.  Effect of Software Structure Attributes on Software Development Productivity , 1997, J. Syst. Softw..

[24]  John C. Windsor,et al.  Determinants of software volatility: a field study , 2003, J. Softw. Maintenance Res. Pract..

[25]  W. Shadish,et al.  Experimental and Quasi-Experimental Designs for Generalized Causal Inference , 2001 .

[26]  Liguo Yu Indirectly predicting the maintenance effort of open-source software , 2006, J. Softw. Maintenance Res. Pract..

[27]  C. V. Ramamoorthy,et al.  METRICS GUIDED METHODOLOGY. , 1985 .

[28]  Dewayne E. Perry,et al.  Toward understanding the rhetoric of small source code changes , 2005, IEEE Transactions on Software Engineering.

[29]  E. Burton Swanson,et al.  System Life Expectancy and the Maintenance Effort: Exploring Their Equilibration , 2000, MIS Q..

[30]  Norman E. Fenton,et al.  Software Metrics: A Rigorous Approach , 1991 .

[31]  Youngjoong Ko,et al.  Automatic Text Categorization by Unsupervised Learning , 2000, COLING.

[32]  Walt Scacchi,et al.  Understanding Open Source Software Evolution: Applying, Breaking, and Rethinking the Laws of Software Evolution , 2003 .

[33]  John Dunagan,et al.  Towards a self-managing software patching process using black-box persistent-state manifests , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[34]  David P. Darcy,et al.  Managerial Use of Metrics for Object-Oriented Software: An Exploratory Analysis , 1998, IEEE Trans. Software Eng..

[35]  Liguo Yu Indirectly predicting the maintenance effort of open-source software: Research Articles , 2006 .

[36]  Audris Mockus,et al.  Identifying reasons for software changes using historic databases , 2000, Proceedings 2000 International Conference on Software Maintenance.

[37]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[38]  W. G. Cochran,et al.  Some consequences when the assumptions for the analysis of variance are not satisfied. , 1947, Biometrics.

[39]  Massimiliano Di Penta,et al.  An approach to classify software maintenance requests , 2002, International Conference on Software Maintenance, 2002. Proceedings..

[40]  D. Goodin The cambridge dictionary of statistics , 1999 .

[41]  Stephen R. Schach,et al.  Determining the Distribution of Maintenance Categories: Survey versus Measurement , 2003, Empirical Software Engineering.

[42]  E. B. Swanson,et al.  Software maintenance management , 1980 .

[43]  Howard B. Lee,et al.  Foundations of Behavioral Research , 1973 .

[44]  Rajiv D. Banker,et al.  Software complexity and maintenance costs , 1993, CACM.

[45]  Crispin Cowan,et al.  Timing the Application of Security Patches for Optimal Uptime , 2002, LISA.

[46]  Jr. Frederick P. Brooks,et al.  The mythical man-month (anniversary ed.) , 1995 .

[47]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[48]  James E. Tomayko,et al.  The structural complexity of software an experimental test , 2005, IEEE Transactions on Software Engineering.

[49]  Paul W. Oman,et al.  Software vulnerability mitigation as a proper subset of software maintenance , 2005, J. Softw. Maintenance Res. Pract..

[50]  H. E. Dunsmore,et al.  Software engineering metrics and models , 1986 .