Applying the Mahalanobis-Taguchi strategy for software defect diagnosis

The Mahalanobis-Taguchi (MT) strategy combines mathematical and statistical concepts like Mahalanobis distance, Gram-Schmidt orthogonalization and experimental designs to support diagnosis and decision-making based on multivariate data. The primary purpose is to develop a scale to measure the degree of abnormality of cases, compared to “normal” or “healthy” cases, i.e. a continuous scale from a set of binary classified cases. An optimal subset of variables for measuring abnormality is then selected and rules for future diagnosis are defined based on them and the measurement scale. This maps well to problems in software defect prediction based on a multivariate set of software metrics and attributes. In this paper, the MT strategy combined with a cluster analysis technique for determining the most appropriate training set, is described and applied to well-known datasets in order to evaluate the fault-proneness of software modules. The measurement scale resulting from the MT strategy is evaluated using ROC curves and shows that it is a promising technique for software defect diagnosis. It compares favorably to previously evaluated methods on a number of publically available data sets. The special characteristic of the MT strategy that it quantifies the level of abnormality can also stimulate and inform discussions with engineers and managers in different defect prediction situations.

[1]  Mei-Ling Huang,et al.  Development and comparison of automated classifiers for glaucoma diagnosis using Stratus optical coherence tomography. , 2005, Investigative ophthalmology & visual science.

[2]  Rajesh Jugulum,et al.  The Mahalanobis–Taguchi Strategy , 2000 .

[3]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[4]  José Javier Dolado,et al.  On the problem of the software cost function , 2001, Inf. Softw. Technol..

[5]  Taghi M. Khoshgoftaar,et al.  An empirical study of predicting software faults with case-based reasoning , 2006, Software Quality Journal.

[6]  Lionel C. Briand,et al.  Assessing the Applicability of Fault-Proneness Models Across Object-Oriented Software Projects , 2002, IEEE Trans. Software Eng..

[7]  David J. Hand,et al.  ROC Curves for Continuous Data , 2009 .

[8]  Tsung-Shin Hsu,et al.  The Mahalanobis-Taguchi system - Neural network algorithm for data-mining in dynamic environments , 2009, Expert Syst. Appl..

[9]  Yao Wang,et al.  A robust and scalable clustering algorithm for mixed type attributes in large database environment , 2001, KDD '01.

[10]  Yuming Zhou,et al.  Empirical Analysis of Object-Oriented Design Metrics for Predicting High and Low Severity Faults , 2006, IEEE Transactions on Software Engineering.

[11]  Seoung Bum Kim,et al.  A Review and Analysis of the Mahalanobis—Taguchi System , 2003, Technometrics.

[12]  Elizabeth A. Cudney,et al.  Identifying Useful Variables for Vehicle Braking Using the Adjoint Matrix Approach to the Mahalanobis-Taguchi System , 2007 .

[13]  Rajesh Jugulum,et al.  The Mahalanobis-Taguchi strategy : a pattern technology system , 2002 .

[14]  Chao-Ton Su,et al.  An Evaluation of the Robustness of MTS for Imbalanced Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[15]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[16]  Jhareswar Maiti,et al.  Development of a hybrid methodology for dimensionality reduction in Mahalanobis-Taguchi system using Mahalanobis distance and binary particle swarm optimization , 2010, Expert Syst. Appl..

[17]  田口 玄一,et al.  New Trends in Multivariate Diagnosis , 2001 .

[18]  P. Das,et al.  Exploring the effects of chemical composition in hot rolled steel product using Mahalanobis distance scale under Mahalanobis–Taguchi system , 2007 .

[19]  Niclas Ohlsson,et al.  Predicting Fault-Prone Software Modules in Telephone Switches , 1996, IEEE Trans. Software Eng..

[20]  Lionel C. Briand,et al.  A systematic and comprehensive investigation of methods to build and evaluate fault prediction models , 2010, J. Syst. Softw..

[21]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[22]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[23]  Chao-Ton Su,et al.  Multiclass MTS for Simultaneous Feature Selection and Classification , 2009, IEEE Transactions on Knowledge and Data Engineering.

[24]  Du Zhang,et al.  Advances in Machine Learning Applications in Software Engineering , 2007 .

[25]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[26]  William Marsh,et al.  On the effectiveness of early life cycle defect prediction with Bayesian Nets , 2008, Empirical Software Engineering.

[27]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[28]  Lih-Yuan Deng,et al.  Orthogonal Arrays: Theory and Applications , 1999, Technometrics.

[29]  Yogesh Singh,et al.  Empirical Investigation of Metrics for Fault Prediction on Object-Oriented Software , 2008, Computer and Information Science.

[30]  Tony Gorschek,et al.  Genetic programming for cross-release fault count predictions in large and complex software projects , 2010 .

[31]  Bojan Cukic,et al.  A Statistical Framework for the Prediction of Fault-Proneness , 2007 .

[32]  Monica Chis,et al.  Evolutionary Computation and Optimization Algorithms in Software Engineering: Applications and Techniques , 2010 .

[33]  Shari Lawrence Pfleeger,et al.  Software Metrics : A Rigorous and Practical Approach , 1998 .