Reducing overfitting in genetic programming models for software quality classification

A high-assurance system is largely dependent on the quality of its underlying software. Software quality models can provide timely estimations of software quality, allowing the detection and correction of faults prior to operations. A software metrics-based quality prediction model may depict overfitting, which occurs when a prediction model has good accuracy on the training data but relatively poor accuracy on the test data. We present an approach to address the overfitting problem in the context of software quality classification models based on genetic programming (GP). The problem has not been addressed in depth for GP-based models. The presence of overfitting in a software quality classification model affects its practical usefulness, because management is interested in good performance of the model when applied to unseen software modules, i.e., generalization performance. In the process of building GP-based software quality classification models for a high-assurance telecommunications system, we observed that the GP models were prone to overfitting. We utilize a random sampling technique to reduce overfitting in our GP models. The approach has been found by many researchers as an effective method for reducing the time of a GP run. However, in our study we utilize random to reduce overfitting with the aim of improving the generalization capability of our GP models.

[1]  Andrzej Osyczka,et al.  Multicriteria Design Optimization: Procedures and Applications , 1990 .

[2]  Peter Nordin,et al.  Genetic programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications , 1998 .

[3]  Terence Soule,et al.  Code growth in genetic programming , 1996 .

[4]  D. Ballard,et al.  Complexity Drift in Evolutionary Computation with Tree Representations , 1996 .

[5]  P. Ross,et al.  An adverse interaction between crossover and restricted tree depth in genetic programming , 1996 .

[6]  Edwin D. de Jong,et al.  Reducing bloat and promoting diversity using multi-objective methods , 2001 .

[7]  Juhani Koski,et al.  Multicriteria Design Optimization , 1990 .

[8]  Peter Nordin,et al.  Complexity Compression and Evolution , 1995, ICGA.

[9]  Taghi M. Khoshgoftaar,et al.  Using regression trees to classify fault-prone software modules , 2002, IEEE Trans. Reliab..

[10]  John R. Koza,et al.  Genetic programming (videotape): the movie , 1992 .

[11]  Taghi M. Khoshgoftaar,et al.  Genetic programming-based decision trees for software quality classification , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[12]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[13]  Kalyanmoy Deb,et al.  Muiltiobjective Optimization Using Nondominated Sorting in Genetic Algorithms , 1994, Evolutionary Computation.

[14]  Terence Soule,et al.  Effects of Code Growth and Parsimony Pressure on Populations in Genetic Programming , 1998, Evolutionary Computation.

[15]  Brian J. Ross The Effects of Randomly Sampled Training Data on Program Evolution , 2000, GECCO.

[16]  W. Langdon An Analysis of the MAX Problem in Genetic Programming , 1997 .

[17]  Walter Alden Tackett,et al.  Genetic Programming for Feature Discovery and Image Discrimination , 1993, ICGA.

[18]  Riccardo Poli,et al.  Fitness Causes Bloat: Mutation , 1997, EuroGP.

[19]  Lionel C. Briand,et al.  Assessing the Applicability of Fault-Proneness Models Across Object-Oriented Software Projects , 2002, IEEE Trans. Software Eng..

[20]  Ming Zhao,et al.  Application of multivariate analysis for software fault prediction , 1998, Software Quality Journal.

[21]  N. E. Schneidewind,et al.  Body of Knowledge for Software Quality Measurement , 2002, Computer.

[22]  P. Nordin,et al.  Explicitly defined introns and destructive crossover in genetic programming , 1996 .

[23]  Katia Sycara,et al.  The Importance of Simplicity and Validation in Genetic Programming for Data Mining in Financial Data , 1999 .

[24]  Taghi M. Khoshgoftaar,et al.  Genetic programming model for software quality classification , 2001, Proceedings Sixth IEEE International Symposium on High Assurance Systems Engineering. Special Topic: Impact of Networking.

[25]  Taghi M. Khoshgoftaar,et al.  Emerald: Software Metrics and Models on the Desktop , 1996, IEEE Softw..

[26]  W. Pedrycz,et al.  Software quality prediction using median-adjusted class labels , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[27]  Taghi M. Khoshgoftaar,et al.  Predicting Fault-Prone Modules in Embedded Systems Using Analogy-Based Classification Models , 2002, Int. J. Softw. Eng. Knowl. Eng..