Genetic programming-based decision trees for software quality classification

The knowledge of the likely problematic areas of a software system is very useful for improving its overall quality. Based on such information, a more focused software testing and inspection plan can be devised. Decision trees are attractive for a software quality classification problem which predicts the quality of program modules in terms of risk-based classes. They provide a comprehensible classification model which can be directly interpreted by observing the tree-structure. A simultaneous optimization of the classification accuracy and the size of the decision tree is a difficult problem, and very few studies have addressed the issue. This paper presents an automated and simplified genetic programming (gp) based decision tree modeling technique for the software quality classification problem. Genetic programming is ideally suited for problems that require optimization of multiple criteria. The proposed technique is based on multi-objective optimization using strongly typed GP. In the context of an industrial high-assurance software system, two fitness functions are used for the optimization problem: one for minimizing the average weighted cost of misclassification, and one for controlling the size of the decision tree. The classification performances of the GP-based decision trees are compared with those based on standard GP, i.e., S-expression tree. It is shown that the GP-based decision tree technique yielded better classification models. As compared to other decision tree-based methods, such as C4.5, GP-based decision trees are more flexible and can allow optimization of performance objectives other than accuracy. Moreover, it provides a practical solution for building models in the presence of conflicting objectives, which is commonly observed in software development practice.

[1]  David J. Montana,et al.  Strongly Typed Genetic Programming , 1995, Evolutionary Computation.

[2]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[3]  Norman F. Schneidewind,et al.  Investigation of logistic regression as a discriminant of software quality , 2001, Proceedings Seventh International Software Metrics Symposium.

[4]  Nikolay I. Nikolaev,et al.  Inductive Genetic Programming with Decision Trees , 1997, Intell. Data Anal..

[5]  David W. Aha,et al.  Simplifying decision trees: A survey , 1997, The Knowledge Engineering Review.

[6]  Nikolay I. Nikolaev,et al.  Inductive Genetic Programming with Decision Trees , 1998, Intell. Data Anal..

[7]  W. Stadler Preference Optimality and Applications of Pareto-Optimality , 1975 .

[8]  Witold Pedrycz,et al.  Software quality analysis with the use of computational intelligence , 2002, 2002 IEEE World Congress on Computational Intelligence. 2002 IEEE International Conference on Fuzzy Systems. FUZZ-IEEE'02. Proceedings (Cat. No.02CH37291).

[9]  William B. Langdon,et al.  Application of Genetic Programming to Induction of Linear Classification Trees , 2000, EuroGP.

[10]  Peter J. Angeline,et al.  Advances in genetic programming: volume 2 , 1996 .

[11]  Hitoshi Iba,et al.  Genetic programming using a minimum description length principle , 1994 .

[12]  Taghi M. Khoshgoftaar,et al.  Genetic programming model for software quality classification , 2001, Proceedings Sixth IEEE International Symposium on High Assurance Systems Engineering. Special Topic: Impact of Networking.

[13]  I. Kononenko,et al.  INDUCTION OF DECISION TREES USING RELIEFF , 1995 .

[14]  Sreerama K. Murthy,et al.  Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey , 1998, Data Mining and Knowledge Discovery.

[15]  Yoichi Muraoka,et al.  Building software quality classification trees: approach, experimentation, evaluation , 1997, Proceedings The Eighth International Symposium on Software Reliability Engineering.

[16]  Peter J. Angeline,et al.  Type Inheritance in Strongly Typed Genetic Programming , 1996 .

[17]  Ming Zhao,et al.  Application of multivariate analysis for software fault prediction , 1998, Software Quality Journal.

[18]  John R. Koza,et al.  Genetic Programming II , 1992 .

[19]  Cullen Schaffer,et al.  Deconstructing the Digit Recognition Problem , 1992, ML.

[20]  Taghi M. Khoshgoftaar,et al.  Balancing Misclassification Rates in Classification-Tree Models of Software Quality , 2004, Empirical Software Engineering.

[21]  Peter Nordin,et al.  Genetic programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications , 1998 .