Sharing confidential data for algorithm development by multiple imputation

The availability of real-life data sets is of crucial importance for algorithm and application development, as these often require insight into the specific properties of the data. Often, however, such data are not released because of their proprietary and confidential nature. We propose to solve this problem using the statistical technique of multiple imputation, which is used as a powerful method for generating realistic synthetic data sets. Additionally, it is shown how the generated records can be combined into networked data using clustering techniques.

[1]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[2]  Elisa Bertino,et al.  A Survey of Quantification of Privacy Preserving Data Mining Algorithms , 2008, Privacy-Preserving Data Mining.

[3]  Richard Penny,et al.  Multiply Imputed Synthetic Data Files , 2007 .

[4]  Jian Pei,et al.  A brief survey on anonymization techniques for privacy preserving publishing of social network data , 2008, SKDD.

[5]  Anna Oganian,et al.  Global Measures of Data Utility for Microdata Masked for Disclosure Limitation , 2009, J. Priv. Confidentiality.

[6]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[7]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[8]  Simon Jackman,et al.  Estimation and Inference via Bayesian Simulation: An Introduction to Markov Chain Monte Carlo , 2000 .

[9]  Jerome P. Reiter,et al.  Releasing multiply-imputed synthetic data generated in two stages to protect confidentiality , 2007 .

[10]  Fang Liu,et al.  Statistical Disclosure Techniques Based on Multiple Imputation , 2005 .

[11]  Jerome P. Reiter,et al.  Estimating Risks of Identification Disclosure in Partially Synthetic Data , 2009, J. Priv. Confidentiality.

[12]  Jerome P. Reiter Multiple Imputation for Disclosure Limitation: Future Research Challenges , 2010, J. Priv. Confidentiality.

[13]  Peter Filzmoser,et al.  Simulation of close-to-reality population data for household surveys with application to EU-SILC , 2011, Stat. Methods Appl..

[14]  L. Cox Statistical Disclosure Limitation , 2006 .

[15]  Hannes Federrath Privacy Enhanced Technologies: Methods - Markets - Misuse , 2005, TrustBus.

[16]  Mark Huisman,et al.  Imputation of missing network data: Some simple procedures , 2009, J. Soc. Struct..

[17]  Garry Robins,et al.  An introduction to exponential random graph (p*) models for social networks , 2007, Soc. Networks.

[18]  Mirko Krivánek,et al.  NP-hard problems in hierarchical-tree clustering , 1986, Acta Informatica.

[19]  Peter Filzmoser,et al.  Simulation of synthetic population data for household surveys with application to EU-SILC , 2010 .

[20]  Philip S. Yu,et al.  Privacy-Preserving Data Mining - Models and Algorithms , 2008, Advances in Database Systems.

[21]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[22]  Xiaowei Ying,et al.  Comparisons of randomization and K-degree anonymization schemes for privacy preserving social network publishing , 2009, SNA-KDD '09.

[23]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[24]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[25]  Anneke Zuiderwijk,et al.  Trusted third parties for secure and privacy-preserving data integration and sharing in the public sector , 2012, dg.o '12.

[26]  Josep Domingo-Ferrer,et al.  A Survey of Inference Control Methods for Privacy-Preserving Data Mining , 2008, Privacy-Preserving Data Mining.

[27]  Philip S. Yu,et al.  Anonymizing Classification Data for Privacy Preservation , 2007, IEEE Transactions on Knowledge and Data Engineering.

[28]  R. Little Missing-Data Adjustments in Large Surveys , 1988 .

[29]  Philip S. Yu,et al.  Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques , 2010 .

[30]  Jan van Dijk,et al.  Preserving privacy whilst integrating data: Applied to criminal justice , 2010, Inf. Polity.

[31]  Miguel Soriano,et al.  Trust, Privacy and Security in Digital Business , 2010, Lecture Notes in Computer Science.

[32]  Jerome P. Reiter,et al.  Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata , 2010 .

[33]  Jörg Drechsler,et al.  Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation , 2011 .

[34]  Jerome P. Reiter Significance tests for multi-component estimands from multiply imputed, synthetic microdata , 2005 .

[35]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[36]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .