Complexity-based classification of software modules

Software plays a major role in many organizations. Organizational success depends partially on the quality of software used. In recent years, many researchers have recognized that statistical classification techniques are well-suited to develop software quality prediction models. Different statistical software quality models, using complexity metrics as early indicators of software quality, have been proposed in the past. At a high-level the problem of software categorization is to classify software modules into fault prone and non-fault prone. The focus of this thesis is two-fold. One is to study some selected classification techniques including unsupervised and supervised learning algorithms widely used for software categorization. The second emphasis is to explore a new unsupervised learning model, employing Bayesian and deterministic approaches. Besides, we evaluate and compare experimentally these approaches using a real data set. Our experimental results show that different algorithms lead to different statistically significant results.

[1]  Abhijit S. Pandya,et al.  A comparative study of pattern recognition techniques for quality evaluation of telecommunications software , 1994, IEEE J. Sel. Areas Commun..

[2]  Yoichi Muraoka,et al.  Building software quality classification trees: approach, experimentation, evaluation , 1997, Proceedings The Eighth International Symposium on Software Reliability Engineering.

[3]  B. Sinha,et al.  A characterization of Dirichlet distributions , 1988 .

[4]  R. Pressman Software Engineering: a Practioner''s approach , 1987 .

[5]  J. Mosimann On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions , 1962 .

[6]  Abhijit S. Pandya,et al.  Application of neural networks for predicting program faults , 1995, Ann. Softw. Eng..

[7]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[8]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[9]  Min Xie,et al.  Handbook of Software Reliability Engineering, by Michael R. Lyu (Editor), McGraw-Hill and IEEE Computer Society, 1996 (Book Review) , 1997, Software testing, verification & reliability.

[10]  J. Fabius Two Characterizations of the Dirichlet Distribution , 1973 .

[11]  L. Wasserman,et al.  Computing Bayes Factors by Combining Simulation and Asymptotic Approximations , 1997 .

[12]  Norman F. Schneidewind,et al.  Minimizing risk in applying metrics on multiple projects , 1992, [1992] Proceedings Third International Symposium on Software Reliability Engineering.

[13]  Taghi M. Khoshgoftaar,et al.  Classification of Fault-Prone Software Modules: Prior Probabilities, Costs, and Model Evaluation , 1998, Empirical Software Engineering.

[14]  D. Geiger,et al.  A characterization of the Dirichlet distribution through global and local parameter independence , 1997 .

[15]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[16]  Xiao-Li Meng,et al.  The EM Algorithm—an Old Folk‐song Sung to a Fast New Tune , 1997 .

[17]  Taghi M. Khoshgoftaar,et al.  The Detection of Fault-Prone Programs , 1992, IEEE Trans. Software Eng..

[18]  Taghi M. Khoshgoftaar,et al.  Early Quality Prediction: A Case Study in Telecommunications , 1996, IEEE Softw..

[19]  David G. Stork,et al.  Pattern Classification , 1973 .

[20]  Martin Hitz,et al.  Chidamber & Kemerer's Metrics Suite: a Measurement Theory Perspective , 1996 .

[21]  Taghi M. Khoshgoftaar,et al.  A tree-based classification model for analysis of a military software system , 1996, HASE.

[22]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[23]  Bev Littlewood,et al.  Evaluating Testing Methods by Delivered Reliability , 1998, IEEE Trans. Software Eng..

[24]  Taghi M. Khoshgoftaar,et al.  LOGISTIC REGRESSION MODELING OF SOFTWARE QUALITY , 1999 .

[25]  Sallie Henry,et al.  Predicting maintainability with software quality metrics , 1991, J. Softw. Maintenance Res. Pract..

[26]  Taghi M. Khoshgoftaar,et al.  An assessment of software quality in a C++ environment , 1995, Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE'95.

[27]  Paolo Giudici,et al.  Applied Data Mining: Statistical Methods for Business and Industry , 2003 .

[28]  Nizar Bouguila,et al.  Novel Mixtures Based on the Dirichlet Distribution: Application to Data and Image Classification , 2003, MLDM.

[29]  Taghi M. Khoshgoftaar,et al.  Alternative approaches for the use of metrics to order programs by complexity , 1994, J. Syst. Softw..

[30]  R. W. Selby,et al.  Empirically based analysis of failures in software systems , 1990 .

[31]  Taghi M. Khoshgoftaar,et al.  Predicting Software Development Errors Using Software Complexity Metrics , 1990, IEEE J. Sel. Areas Commun..

[32]  Michael R. Lyu,et al.  A novel method for early software quality prediction based on support vector machine , 2005, 16th IEEE International Symposium on Software Reliability Engineering (ISSRE'05).

[33]  Nizar Bouguila,et al.  Practical Bayesian estimation of a finite beta mixture through gibbs sampling and its applications , 2006, Stat. Comput..

[34]  N. Metropolis,et al.  The Monte Carlo method. , 1949 .

[35]  L. Joseph,et al.  Bayesian Statistics: An Introduction , 1989 .

[36]  Mark Lorenz Object-Oriented Software Metrics , 1994 .

[37]  H. E. Dunsmore,et al.  Software Science Revisited: A Critical Analysis of the Theory and Its Empirical Support , 1983, IEEE Transactions on Software Engineering.

[38]  J. Mosimann,et al.  A New Characterization of the Dirichlet Distribution Through Neutrality , 1980 .

[39]  Martin J. Shepperd,et al.  Comparing Software Prediction Techniques Using Simulation , 2001, IEEE Trans. Software Eng..

[40]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[41]  Taghi M. Khoshgoftaar,et al.  The impact of costs of misclassification on software quality modeling , 1997, Proceedings Fourth International Software Metrics Symposium.

[42]  Taghi M. Khoshgoftaar,et al.  Detection of fault-prone software modules during a spiral life cycle , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[43]  Andrew Gelman,et al.  General methods for monitoring convergence of iterative simulations , 1998 .

[44]  Shari Lawrence Pfleeger,et al.  Using multiple metrics for analysis of improvement , 1992, Software Quality Journal.

[45]  Sinclair Guillaume Stockman,et al.  A Framework for Software Quality Measurement , 1990, IEEE J. Sel. Areas Commun..

[46]  Taghi M. Khoshgoftaar,et al.  The dimensionality of program complexity , 1989, ICSE '89.

[47]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[48]  D. Lindley,et al.  Approximate Bayesian methods , 1980 .

[49]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[50]  W. M. Bolstad Introduction to Bayesian Statistics , 2004 .

[51]  Linda M. Ottenstein Quantitative Estimates of Debugging Requirements , 1979, IEEE Transactions on Software Engineering.

[52]  Norman F. Schneidewind,et al.  Methodology For Validating Software Metrics , 1992, IEEE Trans. Software Eng..

[53]  Taghi M. Khoshgoftaar,et al.  Process measures for predicting software quality , 1997, Proceedings 1997 High-Assurance Engineering Workshop.

[54]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[55]  G. Casella,et al.  Perfect Slice Samplers for Mixtures of Distributions , 1999 .

[56]  Norman F. Schneidewind Software metrics validation: Space Shuttle flight software example , 1995, Ann. Softw. Eng..

[57]  Gérard Letac,et al.  A Transient Random Walk on Stochastic Matrices with Dirichlet Distributions , 1994 .

[58]  Bill Curtis,et al.  Measuring the Psychological Complexity of Software Maintenance Tasks with the Halstead and McCabe Metrics , 1979, IEEE Transactions on Software Engineering.

[59]  Norman F. Schneidewind Software metrics model for integrating quality control and prediction , 1997, Proceedings The Eighth International Symposium on Software Reliability Engineering.

[60]  Chris F. Kemerer,et al.  A Metrics Suite for Object Oriented Design , 2015, IEEE Trans. Software Eng..

[61]  Stephen H. Kan,et al.  Metrics and Models in Software Quality Engineering , 1994, SOEN.

[62]  Taghi M. Khoshgoftaar,et al.  Which Software Modules have Faults which will be Discovered by Customers? , 1999, J. Softw. Maintenance Res. Pract..

[63]  Sallie M. Henry,et al.  Improving software maintenance at Martin Marietta , 1994, IEEE Software.

[64]  K. Vairavan,et al.  An Experimental Study of Software Metrics for Real-Time Software , 1985, IEEE Transactions on Software Engineering.

[65]  Adam A. Porter,et al.  Empirically guided software development using metric-based classification trees , 1990, IEEE Software.

[66]  Taghi M. Khoshgoftaar,et al.  Modeling fault-prone modules of subsystems , 2000, Proceedings 11th International Symposium on Software Reliability Engineering. ISSRE 2000.

[67]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[68]  Sallie M. Henry,et al.  Object-oriented metrics that predict maintainability , 1993, J. Syst. Softw..

[69]  Wei-Tek Tsai,et al.  EVALUATION OF SOFTWARE METRICS USING DISCRIMINANT ANALYSIS. , 1987 .

[70]  Tze-Jie Yu,et al.  Identifying Error-Prone Software—An Empirical Study , 1985, IEEE Transactions on Software Engineering.

[71]  David Heckerman,et al.  A Characterization of the Dirichlet Distribution Through Global and Local Independence , 1994, UAI 1994.

[72]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[73]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[74]  H. Akaike A new look at the statistical model identification , 1974 .

[75]  Horst Zuse Comments to the Paper: Briand, Eman, Morasca: On the Application of Measurement Theory in Software Engineering , 2004, Empirical Software Engineering.

[76]  Brian D. Ripley,et al.  Stochastic Simulation , 2005 .

[77]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[78]  Elaine J. Weyuker,et al.  Evaluating Software Complexity Measures , 2010, IEEE Trans. Software Eng..

[79]  Nizar Bouguila,et al.  Unsupervised learning of a finite mixture model based on the Dirichlet distribution and its application , 2004, IEEE Transactions on Image Processing.

[80]  Norman E. Fenton,et al.  Measurement : A Necessary Scientific Basis , 2004 .

[81]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[82]  Sandro Morasca,et al.  On the application of measurement theory in software engineering , 2004, Empirical Software Engineering.

[83]  Alan D. Mayer,et al.  Statistical methods for the analysis of software metrics data , 1992, Software Quality Journal.

[84]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[85]  Taghi M. Khoshgoftaar,et al.  A comparative study of predictive models for program changes during system testing and maintenance , 1993, 1993 Conference on Software Maintenance.

[86]  Norman F. Schneidewind Validating software metrics: producing quality discriminators , 1991, Proceedings. 1991 International Symposium on Software Reliability Engineering.

[87]  Taghi M. Khoshgoftaar,et al.  The impact of software evolution and reuse on software quality , 2004, Empirical Software Engineering.

[88]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[89]  Roger S. Pressman,et al.  Software Engineering: A Practitioner's Approach , 1982 .

[90]  Shari Lawrence Pfleeger,et al.  Lessons learned in building a corporate metrics program , 1993, IEEE Software.

[91]  Walter R. Gilks,et al.  Hypothesis testing and model selection , 1995 .

[92]  Taghi M. Khoshgoftaar,et al.  A Comparative Study of Ordering and Classification of Fault-Prone Software Modules , 1999, Empirical Software Engineering.

[93]  Peter Congdon,et al.  Applied Bayesian Modelling , 2003 .

[94]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[95]  B. Everitt,et al.  Finite Mixture Distributions , 1981 .

[96]  Michael R. Lyu,et al.  Software quality prediction using mixture models with EM algorithm , 2000, Proceedings First Asia-Pacific Conference on Quality Software.

[97]  M. Aitkin Likelihood and Bayesian analysis of mixtures , 2001 .

[98]  J. Darroch,et al.  A Characterization of the Dirichlet Distribution , 1971 .

[99]  Ming Zhao,et al.  Application of multivariate analysis for software fault prediction , 1998, Software Quality Journal.

[100]  Norman F. Schneidewind,et al.  Investigation of logistic regression as a discriminant of software quality , 2001, Proceedings Seventh International Software Metrics Symposium.

[101]  David Maxwell Chickering,et al.  Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables , 1997, Machine Learning.

[102]  Nizar Bouguila,et al.  Unsupervised learning of a finite discrete mixture: Applications to texture modeling and image databases summarization , 2007, J. Vis. Commun. Image Represent..

[103]  Gilbert Le Gall,et al.  Studies on Measuring Software , 1990, IEEE J. Sel. Areas Commun..

[104]  Glen W. Russell,et al.  Experience with inspection in ultralarge-scale development , 1991, IEEE Software.

[105]  G. Ronning Maximum likelihood estimation of dirichlet distributions , 1989 .

[106]  Nizar Bouguila,et al.  Unsupervised selection of a finite Dirichlet mixture model: an MML-based approach , 2006, IEEE Transactions on Knowledge and Data Engineering.

[107]  Taghi M. Khoshgoftaar,et al.  A neural network approach for early detection of program modules having high risk in the maintenance phase , 1995, J. Syst. Softw..

[108]  C. S. Wallace,et al.  Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics) , 2005 .

[109]  Elizabeth A. Peck,et al.  Introduction to Linear Regression Analysis , 2001 .

[110]  Taghi M. Khoshgoftaar,et al.  Improving tree-based models of software quality with principal components analysis , 2000, Proceedings 11th International Symposium on Software Reliability Engineering. ISSRE 2000.

[111]  Taghi M. Khoshgoftaar,et al.  Multivariate assessment of complex software systems: a comparative study , 1995, Proceedings of First IEEE International Conference on Engineering of Complex Computer Systems. ICECCS'95.

[112]  Victor R. Basili,et al.  A Pattern Recognition Approach for Software Engineering Data Analysis , 1992, IEEE Trans. Software Eng..

[113]  Lionel C. Briand,et al.  Modeling and managing risk early in software development , 1993, Proceedings of 1993 15th International Conference on Software Engineering.

[114]  Victor R. Basili,et al.  A Validation of Object-Oriented Design Metrics as Quality Indicators , 1996, IEEE Trans. Software Eng..

[115]  Taghi M. Khoshgoftaar,et al.  A practical classification-rule for software-quality models , 2000, IEEE Trans. Reliab..

[116]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[117]  Victor R. Basili,et al.  Developing Interpretable Models with Optimized Set Reduction for Identifying High-Risk Software Components , 1993, IEEE Trans. Software Eng..

[118]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[119]  H. Massam,et al.  A formula on multivariate Dirichlet distributions , 1998 .

[120]  Claes Wohlin,et al.  Identification of green, yellow and red legacy components , 1998, Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272).

[121]  A. Narayanan A note on parameter estimation in the multivariate beta distribution , 1992 .

[122]  T. M. Khoshgoftaar,et al.  Fault severity in models of fault-correction activity , 1995 .