Clustering and dimensionality reduction to determine important software quality metrics

During the last two decades research on software engineering is concentrated on quality. The best approach to quality evaluation goes through determining well-defined metrics on software properties. One such property is module complexity, which is a view of the software that is related to how easily it can be modified. There has been work on constructing a metrics domain which measures the module complexity. Generally, PCA (Principal Component Analysis) is used for defining principal metrics in the domain. Since there are usually no labels for the software data, an unsupervised dimensionality reduction technique, such as PCA needs to be used for determining the most important metrics. In this study, we use the clustering similarity obtained when a certain subset of metrics and when the whole set of metrics are used, to determine the most important metrics. We measure the relative difference/similarity between clusterings using three different indices, namely Rand, Jaccard and Fowlkes-Mallow. We use both backward feature selection and PCA for dimensionality reduction. On the publicly available NASA data, we find out that instead of the whole set of 42 metrics, using only 15 dimensions, we get almost the same clustering performance. Therefore, instead of the whole set of software metrics, a smaller number of them could be used to evaluate the software quality.

[1]  John C. Munson,et al.  Measurement and Validation of Module Coupling Attributes , 2005, Software Quality Journal.

[2]  Taghi M. Khoshgoftaar,et al.  The dimensionality of program complexity , 1989, ICSE '89.

[3]  Lionel C. Briand,et al.  Using coupling measurement for impact analysis in object-oriented systems , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[4]  Taghi M. Khoshgoftaar,et al.  The necessity of assuring quality in software measurement data , 2004, 10th International Symposium on Software Metrics, 2004. Proceedings..

[5]  T. Menzies,et al.  Metrics that matter , 2002, 27th Annual NASA Goddard/IEEE Software Engineering Workshop, 2002. Proceedings..

[6]  Taghi M. Khoshgoftaar,et al.  The necessity of assuring quality in software measurement data , 2004 .

[7]  Yi Liu,et al.  Studying software metrics based on real-world software systems , 2007 .

[8]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .