The Utility Challenge of Privacy-Preserving Data-Sharing in Cross-Company Defect Prediction: An Empirical Study of the CLIFF&MORPH Algorithm

In practice, the data owners of source projects may need to share data without disclosing sensitive information. Therefore, privacy-preserving data-sharing becomes an important topic in cross-company defect prediction (CCDP). In this context, the challenge is how to achieve a high privacy-preserving level while ensuring the utility of the shared privatized data for CCDP. CLIFF&MORPH is a recently proposed state-of-the-art privacy-preserving data-sharing algorithm for CCDP. It has been reported that the CLIFF&MORPH CCDP model produces a promising defect prediction performance. However, we find that ManualDown, a simple (unsupervised) module size model, built on the target projects has a comparable or even better defect prediction performance. Since ManualDown does not require any source project data to build the model, it is free of the privacy-preserving data-sharing challenges for CCDP. This means that, for practitioners, the motivation of applying privacy-preserving data-sharing algorithms to CCDP could not be well justified if the utility challenge is not addressed. We analyze the implications of our findings and outline the directions for future research. In particular, we strongly suggest that future studies at least use ManualDown as a baseline model for comparison to help develop practical privacy-preserving data-sharing algorithms for CCDP.

[1]  Sanjay Goel,et al.  Collaborative Search Log Sanitization: Toward Differential Privacy and Boosted Utility , 2015, IEEE Transactions on Dependable and Secure Computing.

[2]  Michele Lanza,et al.  In∗bug: Visual analytics of bug repositories , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[3]  Tim Menzies,et al.  Better cross company defect prediction , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[4]  Rongxin Wu,et al.  Dealing with noise in defect prediction , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[5]  Jihun Hamm Enhancing utility and privacy with noisy minimax filters , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Josep Domingo-Ferrer,et al.  Enhancing data utility in differential privacy via microaggregation-based k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{docume , 2014, The VLDB Journal.

[7]  Ying Zou,et al.  Cross-Project Defect Prediction Using a Connectivity-Based Unsupervised Classifier , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[8]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[9]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[10]  J. Domingo-Ferrer,et al.  Appendix ( Data use-oriented evaluation ) to article “ Enhancing Data Utility in Differential Privacy via Microaggregation-based k-Anonymity ” , 2014 .

[11]  H. Vincent Poor,et al.  Utility-Privacy Tradeoffs in Databases: An Information-Theoretic Approach , 2011, IEEE Transactions on Information Forensics and Security.

[12]  Svetha Venkatesh,et al.  Privacy Aware K-Means Clustering with High Utility , 2016, PAKDD.

[13]  Tibor Gyimóthy,et al.  Adding Process Metrics to Enhance Modification Complexity Prediction , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[14]  Noboru Sonehara,et al.  Achieving High Data Utility K-Anonymization Using Similarity-Based Clustering Model , 2016, IEICE Trans. Inf. Syst..

[15]  Hongfang Liu,et al.  An Investigation into the Functional Form of the Size-Defect Relationship for Software Modules , 2009, IEEE Transactions on Software Engineering.

[16]  Forrest Shull,et al.  Local versus Global Lessons for Defect Prediction and Effort Estimation , 2013, IEEE Transactions on Software Engineering.

[17]  Hongfang Liu,et al.  Theory of relative defect proneness , 2008, Empirical Software Engineering.

[18]  Lucas Layman,et al.  LACE2: Better Privacy-Preserving Data Sharing for Cross Project Defect Prediction , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[19]  Alberto Bacchelli Mining unstructured software data , 2013 .

[20]  Josep Domingo-Ferrer,et al.  t-Closeness through Microaggregation: Strict Privacy with Enhanced Utility Preservation , 2015, IEEE Transactions on Knowledge and Data Engineering.

[21]  Guangchun Luo,et al.  Transfer learning for cross-company software defect prediction , 2012, Inf. Softw. Technol..

[22]  Josep Domingo-Ferrer,et al.  Utility-preserving differentially private data releases via individual ranking microaggregation , 2015, Inf. Fusion.

[23]  Ayse Basar Bener,et al.  Defect prediction from static code features: current results, limitations, new approaches , 2010, Automated Software Engineering.

[24]  Debajyoti Bera,et al.  Utility and privacy guarantees of differential privacy , 2016 .

[25]  Sousuke Amasaki,et al.  Improving Relevancy Filter Methods for Cross-Project Defect Prediction , 2015, 2015 3rd International Conference on Applied Computing and Information Technology/2nd International Conference on Computational Science and Intelligence.

[26]  Hongfang Liu,et al.  Testing the theory of relative defect proneness for closed-source software , 2010, Empirical Software Engineering.

[27]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[28]  Haibing Lu,et al.  A Utility Maximization Framework for Privacy Preservation of User Generated Content , 2016, ICTIR.

[29]  Tim Menzies,et al.  Balancing Privacy and Utility in Cross-Company Defect Prediction , 2013, IEEE Transactions on Software Engineering.

[30]  Baowen Xu,et al.  Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning , 2015, ESEC/SIGSOFT FSE.