Heterogeneous Defect Prediction

Many recent studies have documented the success of cross-project defect prediction (CPDP) to predict defects for new projects lacking in defect data by using prediction models built by other projects. However, most studies share the same limitations: it requires homogeneous data; i.e., different projects must describe themselves using the same metrics. This paper presents methods for heterogeneous defect prediction (HDP) that matches up different metrics in different projects. Metric matching for HDP requires a “large enough” sample of distributions in the source and target projects—which raises the question on how large is “large enough” for effective heterogeneous defect prediction. This paper shows that empirically and theoretically, “large enough” may be very small indeed. For example, using a mathematical model of defect prediction, we identify categories of data sets were as few as 50 instances are enough to build a defect prediction model. Our conclusion for this work is that, even when projects use different metric sets, it is possible to quickly transfer lessons learned about defect prediction.