Privacy preserving defect prediction using generalization and entropy-based data reduction

The software engineering community produces data that can be analyzed to enhance the quality of future software products, and data regarding software defects can be used by data scientists to create defect predictors. However, sharing such data raises privacy concerns, since sensitive software features are usually considered as business assets that should be protected in accordance with the law. Early research efforts on protecting the privacy of software data found that applying conventional data anonymization to mask sensitive attributes of software features degrades the quality of the shared data. In addition, data produced by such approaches is not immune to attacks such as inference and background knowledge attacks. This research proposes a new approach to share protected release of software defects data that can still be used in data science algorithms. We created a generalization (clustering)-based approach to anonymize sensitive software attributes. Tomek link and AllNN data reduction approaches were used to discard noisy records that may affect the usefulness of the shared data. The proposed approach considers diversity of sensitive attributes as an important factor to avoid inference and background knowledge attacks on the anonymized data, therefore data discarded is removed from both defective and non-defective records. We conducted experiments conducted on several benchmark software defect datasets, using both data quality and privacy measures to evaluate the proposed approach. Our findings showed that the proposed approach outperforms existing well-known techniques using accuracy and privacy measures.

[1]  Kun Liu,et al.  Random projection-based multiplicative data perturbation for privacy preserving distributed data mining , 2006, IEEE Transactions on Knowledge and Data Engineering.

[2]  Nageswara Rao Moparthi,et al.  A novel privacy preserving based ensemble cross defect prediction model for decision making , 2016 .

[3]  Chris Clifton,et al.  Tools for privacy preserving distributed data mining , 2002, SKDD.

[4]  Qing Gu,et al.  DP-Share: Privacy-Preserving Software Defect Prediction Model Sharing Through Differential Privacy , 2019, Journal of Computer Science and Technology.

[5]  Yu Fu,et al.  A tree-based approach to preserve the privacy of software engineering data and predictive models , 2009, PROMISE '09.

[6]  Sanjay Goel,et al.  Collaborative Search Log Sanitization: Toward Differential Privacy and Boosted Utility , 2015, IEEE Transactions on Dependable and Secure Computing.

[7]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[8]  Elisa Bertino,et al.  A Survey of Quantification of Privacy Preserving Data Mining Algorithms , 2008, Privacy-Preserving Data Mining.

[9]  Tim Menzies,et al.  When to use data from other projects for effort estimation , 2010, ASE.

[10]  Josep Domingo-Ferrer,et al.  A k-anonymous approach to privacy preserving collaborative filtering , 2015, J. Comput. Syst. Sci..

[11]  Osmar R. Zaïane,et al.  Privacy Preserving Clustering by Data Transformation , 2010, J. Inf. Data Manag..

[12]  Philippe Golle,et al.  Revisiting the uniqueness of simple demographics in the US population , 2006, WPES '06.

[13]  Philip S. Yu,et al.  On static and dynamic methods for condensation-based privacy-preserving data mining , 2008, TODS.

[14]  Sanjay Ranka,et al.  Conditional Anomaly Detection , 2007, IEEE Transactions on Knowledge and Data Engineering.

[15]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[16]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[17]  Yufei Tao,et al.  Personalized privacy preservation , 2006, Privacy-Preserving Data Mining.

[18]  Domingo-FerrerJosep,et al.  t-Closeness through Microaggregation , 2015 .

[19]  Tim Menzies,et al.  Balancing Privacy and Utility in Cross-Company Defect Prediction , 2013, IEEE Transactions on Software Engineering.

[20]  Li Cheng,et al.  Privacy preserving via interval covering based subclass division and manifold learning based bi-directional obfuscation for effort estimation , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[21]  Philip S. Yu,et al.  Protecting Sensitive Labels in Social Network Data Anonymization , 2013, IEEE Transactions on Knowledge and Data Engineering.

[22]  Nalin Asanka Gamagedara Arachchilage,et al.  Why developers cannot embed privacy into software systems?: An empirical investigation , 2018, EASE.

[23]  Petra Kaufmann,et al.  Privacy-Preserving Linkage of Genomic and Clinical Data Sets , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[25]  Yin Yang,et al.  Differentially private histogram publication , 2012, The VLDB Journal.

[26]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[27]  Aryya Gangopadhyay,et al.  A data recipient centered de-identification method to retain statistical attributes , 2014, J. Biomed. Informatics.

[28]  Tim Menzies,et al.  Optimizing requirements decisions with keys , 2008, PROMISE '08.

[29]  Roksana Boreli,et al.  Applying Differential Privacy to Matrix Factorization , 2015, RecSys.

[30]  João P. Vilela,et al.  Privacy-Preserving Data Mining: Methods, Metrics, and Applications , 2017, IEEE Access.

[31]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[32]  Jie Wang,et al.  Knowledge and Information Systems REGULAR PAPER , 2006 .

[33]  Philip S. Yu,et al.  A General Survey of Privacy-Preserving Data Mining Models and Algorithms , 2008, Privacy-Preserving Data Mining.

[34]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[35]  José Martínez Sotoca,et al.  Edited Nearest Neighbor Rule for Improving Neural Networks Classifications , 2010, ISNN.