Cross Project Defect Prediction via Balanced Distribution Adaptation Based Transfer Learning

Defect prediction assists the rational allocation of testing resources by detecting the potentially defective software modules before releasing products. When a project has no historical labeled defect data, cross project defect prediction (CPDP) is an alternative technique for this scenario. CPDP utilizes labeled defect data of an external project to construct a classification model to predict the module labels of the current project. Transfer learning based CPDP methods are the current mainstream. In general, such methods aim to minimize the distribution differences between the data of the two projects. However, previous methods mainly focus on the marginal distribution difference but ignore the conditional distribution difference, which will lead to unsatisfactory performance. In this work, we use a novel balanced distribution adaptation (BDA) based transfer learning method to narrow this gap. BDA simultaneously considers the two kinds of distribution differences and adaptively assigns different weights to them. To evaluate the effectiveness of BDA for CPDP performance, we conduct experiments on 18 projects from four datasets using six indicators (i.e., F-measure, g-means, Balance, AUC, EARecall, and EAF-measure). Compared with 12 baseline methods, BDA achieves average improvements of 23.8%, 12.5%, 11.5%, 4.7%, 34.2%, and 33.7% in terms of the six indicators respectively over four datasets.

[1]  Yiqiang Chen,et al.  Balanced Distribution Adaptation for Transfer Learning , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[2]  Baowen Xu,et al.  An Improved SDA Based Defect Prediction Framework for Both Within-Project and Cross-Project Class-Imbalance Problems , 2017, IEEE Transactions on Software Engineering.

[3]  Yuming Zhou,et al.  Are Slice-Based Cohesion Metrics Actually Useful in Effort-Aware Post-Release Fault-Proneness Prediction? An Empirical Study , 2015, IEEE Transactions on Software Engineering.

[4]  Tao Zhang,et al.  Cross Version Defect Prediction with Representative Data via Sparse Subset Selection , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[5]  Chao Liu,et al.  A two-phase transfer learning model for cross-project defect prediction , 2019, Inf. Softw. Technol..

[6]  Xiao-Yuan Jing,et al.  On the Multiple Sources and Privacy Preservation Issues for Heterogeneous Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[7]  Yutao Ma,et al.  TDSelector: A Training Data Selection Method for Cross-Project Defect Prediction , 2016, ArXiv.

[8]  Lefteris Angelis,et al.  Applying the Mahalanobis-Taguchi strategy for software defect diagnosis , 2011, Automated Software Engineering.

[9]  David Lo,et al.  HYDRA: Massively Compositional Model for Cross-Project Defect Prediction , 2016, IEEE Transactions on Software Engineering.

[10]  Jens Grabowski,et al.  A Comparative Study to Benchmark Cross-Project Defect Prediction Approaches , 2018, IEEE Transactions on Software Engineering.

[11]  Michele Lanza,et al.  Evaluating defect prediction approaches: a benchmark and an extensive comparison , 2011, Empirical Software Engineering.

[12]  Hong Mei Understanding “software-defined” from an OS perspective: technical challenges and research issues , 2017, Science China Information Sciences.

[13]  Yuming Zhou,et al.  An empirical study on dependence clusters for effort-aware fault-proneness prediction , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[14]  Baowen Xu,et al.  Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning , 2015, ESEC/SIGSOFT FSE.

[15]  Xin Yao,et al.  Using Class Imbalance Learning for Software Defect Prediction , 2013, IEEE Transactions on Reliability.

[16]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[17]  Yue Jiang,et al.  Techniques for evaluating fault prediction models , 2008, Empirical Software Engineering.

[18]  Jin Liu,et al.  MICHAC: Defect Prediction via Feature Selection Based on Maximal Information Coefficient with Hierarchical Agglomerative Clustering , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[19]  Shane McIntosh,et al.  An Empirical Comparison of Model Validation Techniques for Defect Prediction Models , 2017, IEEE Transactions on Software Engineering.

[20]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[21]  Zhaowei Shang,et al.  Negative samples reduction in cross-company software defects prediction , 2015, Inf. Softw. Technol..

[22]  Steffen Herbold,et al.  Comments on ScottKnottESD in Response to “An Empirical Comparison of Model Validation Techniques for Defect Prediction Models” , 2017, IEEE Transactions on Software Engineering.

[23]  Jongmoon Baik,et al.  A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction , 2015, Journal of Computer Science and Technology.

[24]  Philip S. Yu,et al.  Transfer Feature Learning with Joint Distribution Adaptation , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[26]  Yutao Ma,et al.  Simplification of Training Data for Cross-Project Defect Prediction , 2014, ArXiv.

[27]  Peipei Zhou,et al.  A Data Filtering Method Based on Agglomerative Clustering , 2017, SEKE.

[28]  Rainer Koschke,et al.  Effort-Aware Defect Prediction Models , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[29]  Ivor W. Tsang,et al.  Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.

[30]  Brian Peacock,et al.  Statistical Distributions: Forbes/Statistical Distributions 4E , 2010 .

[31]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[32]  Rongxin Wu,et al.  ReLink: recovering links between bugs and changes , 2011, ESEC/FSE '11.

[33]  Tracy Hall,et al.  Researcher Bias: The Use of Machine Learning in Software Defect Prediction , 2014, IEEE Transactions on Software Engineering.

[34]  Shane McIntosh,et al.  The Impact of Automated Parameter Optimization on Defect Prediction Models , 2018, IEEE Transactions on Software Engineering.

[35]  Xiao-Yuan Jing,et al.  Heterogeneous Defect Prediction Through Multiple Kernel Learning and Ensemble Learning , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[36]  Bing Li,et al.  An Improved Method for Cross-Project Defect Prediction by Simplifying Training Data , 2018, Mathematical Problems in Engineering.

[37]  Qinbao Song,et al.  Data Quality: Some Comments on the NASA Software Defect Datasets , 2013, IEEE Transactions on Software Engineering.

[38]  Tao Zhang,et al.  Software defect prediction based on kernel PCA and weighted extreme learning machine , 2019, Inf. Softw. Technol..

[39]  R. E. Wheeler Statistical distributions , 1983, APLQ.

[40]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[41]  Jongmoon Baik,et al.  A transfer cost-sensitive boosting approach for cross-project defect prediction , 2017, Software Quality Journal.

[42]  Jaechang Nam,et al.  CLAMI: Defect Prediction on Unlabeled Datasets (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[43]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[44]  Tim Menzies,et al.  Heterogeneous Defect Prediction , 2015, IEEE Transactions on Software Engineering.

[45]  Sousuke Amasaki,et al.  Improving Relevancy Filter Methods for Cross-Project Defect Prediction , 2015, 2015 3rd International Conference on Applied Computing and Information Technology/2nd International Conference on Computational Science and Intelligence.

[46]  Yuming Zhou,et al.  How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction , 2018, ACM Trans. Softw. Eng. Methodol..

[47]  Michael R. Lyu,et al.  Handbook of software reliability engineering , 1996 .

[48]  Shane McIntosh,et al.  Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[49]  Baowen Xu,et al.  Cost-sensitive transfer kernel canonical correlation analysis for heterogeneous defect prediction , 2018, Automated Software Engineering.

[50]  David Lo,et al.  Revisiting supervised and unsupervised models for effort-aware just-in-time defect prediction , 2018, Empirical Software Engineering.

[51]  Tim Menzies,et al.  Better cross company defect prediction , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[52]  Daoxu Chen,et al.  A Cluster Based Feature Selection Method for Cross-Project Software Defect Prediction , 2017, Journal of Computer Science and Technology.

[53]  Lionel C. Briand,et al.  Assessing the Applicability of Fault-Proneness Models Across Object-Oriented Software Projects , 2002, IEEE Trans. Software Eng..

[54]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[55]  David Lo,et al.  Supervised vs Unsupervised Models: A Holistic Look at Effort-Aware Just-in-Time Defect Prediction , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[56]  Jian Pei,et al.  Data Mining : Concepts and Techniques 3rd edition Ed. 3 , 2011 .

[57]  Jin Liu,et al.  The Impact of Feature Selection on Defect Prediction Performance: An Empirical Comparison , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[58]  Guangchun Luo,et al.  Transfer learning for cross-company software defect prediction , 2012, Inf. Softw. Technol..

[59]  Zhi-Hua Zhou,et al.  Sample-based software defect prediction with active and semi-supervised learning , 2012, Automated Software Engineering.

[60]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[61]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[62]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[63]  Tao Zhang,et al.  HDA: Cross-Project Defect Prediction via Heterogeneous Domain Adaptation With Dictionary Learning , 2018, IEEE Access.

[64]  Harald C. Gall,et al.  Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.