Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis

A general solution method for the automatic generation of decision (or classification) trees is investigated. The approach is to provide insights through in-depth empirical characterization and evaluation of decision trees for one problem domain, specifically, that of software resource data analysis. The purpose of the decision trees is to identify classes of objects (software modules) that had high development effort, i.e. in the uppermost quartile relative to past data. Sixteen software systems ranging from 3000 to 112000 source lines have been selected for analysis from a NASA production environment. The collection and analysis of 74 attributes (or metrics), for over 4700 objects, capture a multitude of information about the objects: development effort, faults, changes, design style, and implementation style. A total of 9600 decision trees are automatically generated and evaluated. The analysis focuses on the characterization and evaluation of decision tree accuracy, complexity, and composition. The decision trees correctly identified 79.3% of the software modules that had high development effort or faults, on the average across all 9600 trees. The decision trees generated from the best parameter combinations correctly identified 88.4% of the modules on the average. Visualization of the results is emphasized, and sample decision trees are included. >

[1]  Joshua Lederberg,et al.  A Heuristic Programming Study of Theory Formation in Science , 1971, IJCAI.

[2]  Lawrence H. Putnam,et al.  A General Empirical Solution to the Macro Software Sizing and Estimating Problem , 1978, IEEE Transactions on Software Engineering.

[3]  Philip J. Stone,et al.  Experiments in induction , 1966 .

[4]  R. Michalski How to Learn Imprecise Concepts: A Method for Employing a Two-Tiered Knowledge Representation in Learning , 1987 .

[5]  Frederick Hayes-Roth,et al.  An interference matching technique for inducing abstractions , 1978, CACM.

[6]  Frederick Hayes-Roth Patterns of induction and associated knowledge acquisition algorithms , 1976 .

[7]  Sidney Addelman,et al.  trans-Dimethanolbis(1,1,1-trifluoro-5,5-dimethylhexane-2,4-dionato)zinc(II) , 2008, Acta crystallographica. Section E, Structure reports online.

[8]  W. Douglas Brooks,et al.  Software technology payoff: Some statistical evidence , 1981, J. Syst. Softw..

[9]  Tom M. Mitchell,et al.  MODEL-DIRECTED LEARNING OF PRODUCTION RULES1 , 1978 .

[10]  Victor R. Basili,et al.  Metric Analysis and Data Validation Across Fortran Projects , 1983, IEEE Transactions on Software Engineering.

[11]  D. Card Annotated bibliography of Software Engineering Laboratory (SEL) literature , 1982 .

[12]  William G. Cochran,et al.  Experimental Designs, 2nd Edition , 1950 .

[13]  S. Mulaik,et al.  Foundations of Factor Analysis , 1975 .

[14]  Paul Compton,et al.  Inductive knowledge acquisition: a case study , 1987 .

[15]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[16]  Patrick Henry Winston,et al.  Learning structural descriptions from examples , 1970 .

[17]  Pat Langley,et al.  Methods of Conceptual Clustering and their Relation to Numerical Taxonomy. , 1985 .

[18]  Victor R. Basili,et al.  A Methodology for Collecting Valid Software Engineering Data , 1984, IEEE Transactions on Software Engineering.

[19]  Victor R. Basili,et al.  Calculation and use of an environment's characteristic software metric set , 1985, ICSE '85.

[20]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[21]  David Tcheng,et al.  MORE ROBUST CONCEPT LEARNING USING DYNAMICALLY – VARIABLE BIAS , 1987 .

[22]  Nada Lavrac,et al.  The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains , 1986, AAAI.

[23]  Thomas G. Dietterich,et al.  A Comparative Review of Selected Methods for Learning from Examples , 1983 .

[24]  Frederick Hayes-Roth,et al.  Collected papers on the learning and recognition of structured patterns , 1975 .

[25]  Ryszard S. Michalski,et al.  Machine Learning: Challenges of the Eighties , 1986 .

[26]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[27]  Jeffrey C. Schlimmer,et al.  A note on correlational measures , 1986 .

[28]  Claude E. Walston,et al.  A Method of Programming Measurement and Estimation , 1977, IBM Syst. J..

[29]  Pat Langley,et al.  Language Acquisition and Machine Learning. , 1986 .

[30]  Thomas G. Dietterich,et al.  Learning to Predict Sequences , 1985 .

[31]  Victor R. Basili,et al.  Tutorial on models and metrics for software management and engineering : initially presented at COMPSAC80, the IEEE Computer Society's Fourth International Computer Software & Applications Conference, October 27-31, 1980 , 1980 .

[32]  Satosi Watanabe,et al.  Methodologies of Pattern Recognition , 1969 .

[33]  S. Eslinger,et al.  The FORTRAN static source code analyzer program (SAP) user's guide, revision 1 , 1982 .

[34]  Herbert A. Simon,et al.  WHY SHOULD MACHINES LEARN , 1983 .

[35]  Victor R. Basili,et al.  A meta-model for software development resource expenditures , 1981, ICSE '81.

[36]  H. Scheffé,et al.  The Analysis of Variance , 1960 .

[37]  A. Baron Experimental Designs , 1990, The Behavior analyst.

[38]  Thomas G. Dietterich,et al.  Learning and Generalization of Characteristic Descriptions: Evaluation Criteria and Comparative Review of Selected Methods , 1979, IJCAI.

[39]  J. R. Quinlan Discovering rules by induction from large collections of examples Intro-ductory readings in expert s , 1979 .

[40]  Ryszard S. Michalski,et al.  Pattern Recognition as Rule-Guided Inductive Inference , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.