Software defect prediction using semi-supervised learning with dimension reduction

Accurate detection of fault prone modules offers the path to high quality software products while minimizing non essential assurance expenditures. This type of quality modeling requires the availability of software modules with known fault content developed in similar environment. Establishing whether a module contains a fault or not can be expensive. The basic idea behind semi-supervised learning is to learn from a small number of software modules with known fault content and supplement model training with modules for which the fault information is not available. In this study, we investigate the performance of semi-supervised learning for software fault prediction. A preprocessing strategy, multidimensional scaling, is embedded in the approach to reduce the dimensional complexity of software metrics. Our results show that the semi-supervised learning algorithm with dimension-reduction preforms significantly better than one of the best performing supervised learning algorithms, random forest, in situations when few modules with known fault content are available for training.