Privacy-preserving data publishing: A survey of recent developments

The collection of digital information by governments, corporations, and individuals has created tremendous opportunities for knowledge- and information-based decision making. Driven by mutual benefits, or by regulations that require certain data to be published, there is a demand for the exchange and publication of data among various parties. Data in its original form, however, typically contains sensitive information about individuals, and publishing such data will violate individual privacy. The current practice in data publishing relies mainly on policies and guidelines as to what types of data can be published and on agreements on the use of published data. This approach alone may lead to excessive data distortion or insufficient protection. Privacy-preserving data publishing (PPDP) provides methods and tools for publishing useful information while preserving data privacy. Recently, PPDP has received considerable attention in research communities, and many approaches have been proposed for different data publishing scenarios. In this survey, we will systematically summarize and evaluate different approaches to PPDP, study the challenges in practical data publishing, clarify the differences and requirements that distinguish PPDP from other related problems, and propose future research directions.

[1]  Yufei Tao,et al.  Preservation of proximity privacy in publishing numerical sensitive data , 2008, SIGMOD Conference.

[2]  Dino Pedreschi,et al.  Anonymity preserving pattern discovery , 2008, The VLDB Journal.

[3]  Qing Zhang,et al.  Aggregate Query Answering on Anonymized Tables , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[4]  Benjamin C. M. Fung,et al.  Anonymizing sequential releases , 2006, KDD '06.

[5]  R. Słowiński Intelligent Decision Support: Handbook of Applications and Advances of the Rough Sets Theory , 1992 .

[6]  G. Cox,et al.  ~ " " " ' l I ~ " " -" . : -· " J , 2006 .

[7]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[8]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.

[9]  Ashwin Machanavajjhala,et al.  Worst-Case Background Knowledge for Privacy-Preserving Data Publishing , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[10]  Edoardo M. Airoldi,et al.  The Effects of Location Access Behavior on Re-identification Risk in a Distributed Environment , 2006, Privacy Enhancing Technologies.

[11]  Hoeteck Wee,et al.  Toward Privacy in Public Databases , 2005, TCC.

[12]  Jayant R. Haritsa,et al.  A Framework for High-Accuracy Privacy-Preserving Mining , 2005, ICDE.

[13]  Cynthia Dwork,et al.  Practical privacy: the SuLQ framework , 2005, PODS.

[14]  R. Motwani,et al.  Efficient Algorithms for Masking and Finding Quasi-Identifiers , 2007 .

[15]  Wenliang Du,et al.  Deriving private information from randomized data , 2005, SIGMOD '05.

[16]  Benjamin C. M. Fung,et al.  Privacy-preserving data publishing for cluster analysis , 2009, Data Knowl. Eng..

[17]  Alin Deutsch,et al.  Privacy in Database Publishing , 2005, ICDT.

[18]  Raymond Chi-Wing Wong,et al.  Minimality Attack in Privacy Preserving Data Publishing , 2007, VLDB.

[19]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[20]  Philip S. Yu,et al.  Template-based privacy preservation in classification problems , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[21]  Wang-Chien Lee,et al.  Protecting Moving Trajectories with Dummies , 2007, 2007 International Conference on Mobile Data Management.

[22]  Cynthia Dwork,et al.  Privacy, accuracy, and consistency too: a holistic solution to contingency table release , 2007, PODS.

[23]  Irit Dinur,et al.  Revealing information while preserving privacy , 2003, PODS.

[24]  Leslie Burnett,et al.  The "GeneTrustee": a universal identification system that ensures privacy and confidentiality for human genetic databases. , 2003, Journal of law and medicine.

[25]  Jaideep Vaidya,et al.  A Survey of Privacy-Preserving Methods Across Vertically Partitioned Data , 2008, Privacy-Preserving Data Mining.

[26]  Cynthia Dwork,et al.  Ask a Better Question, Get a Better Answer A New Approach to Private Data Analysis , 2007, ICDT.

[27]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[28]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[29]  Chris Clifton,et al.  A secure distributed framework for achieving k-anonymity , 2006, The VLDB Journal.

[30]  Lucila Ohno-Machado,et al.  Using Boolean reasoning to anonymize databases , 1999, Artif. Intell. Medicine.

[31]  Urs Hengartner,et al.  Hiding Location Information from Location-Based Services , 2007, 2007 International Conference on Mobile Data Management.

[32]  David Chaum,et al.  Untraceable electronic mail, return addresses, and digital pseudonyms , 1981, CACM.

[33]  Daniel Kifer,et al.  Injecting utility into anonymized datasets , 2006, SIGMOD Conference.

[34]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[35]  S L Warner,et al.  Randomized response: a survey technique for eliminating evasive answer bias. , 1965, Journal of the American Statistical Association.

[36]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[37]  Philip S. Yu,et al.  On static and dynamic methods for condensation-based privacy-preserving data mining , 2008, TODS.

[38]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[39]  Chris Clifton,et al.  Hiding the presence of individuals from shared databases , 2007, SIGMOD '07.

[40]  Martin E. Hellman,et al.  An improved algorithm for computing logarithms over GF(p) and its cryptographic significance (Corresp.) , 1978, IEEE Trans. Inf. Theory.

[41]  Dorothy E. Denning Commutative Filters for Reducing Inference Threats in Multilevel Database Systems , 1985, 1985 IEEE Symposium on Security and Privacy.

[42]  Nina Mishra,et al.  Simulatable auditing , 2005, PODS.

[43]  Harry S. Delugach,et al.  A Fast Algorithm for Detecting Second Paths in Database Inference Analysis , 1995, J. Comput. Secur..

[44]  Cynthia Dwork,et al.  Privacy-Preserving Datamining on Vertically Partitioned Databases , 2004, CRYPTO.

[45]  Chris Clifton,et al.  Multirelational k-Anonymity , 2007, IEEE Transactions on Knowledge and Data Engineering.

[46]  Jian Pei,et al.  Utility-based anonymization using local recoding , 2006, KDD '06.

[47]  Pierangela Samarati,et al.  Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression , 1998 .

[48]  L. Cox Suppression Methodology and Statistical Disclosure Control , 1980 .

[49]  Philip S. Yu,et al.  Time Series Compressibility and Privacy , 2007, VLDB.

[50]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[51]  Philip S. Yu,et al.  Anonymizing transaction databases for publication , 2008, KDD.

[52]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[53]  Benjamin C. M. Fung,et al.  Privacy protection for RFID data , 2009, SAC '09.

[54]  Josep Domingo-Ferrer,et al.  A Critique of k-Anonymity and Some of Its Enhancements , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[55]  P. Doyle,et al.  Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies , 2001 .

[56]  Chris Clifton,et al.  Privacy-Preserving Distributed k-Anonymity , 2005, DBSec.

[57]  Li Liu,et al.  RFID Application in Hospitals: A Case Study on a Demonstration RFID Project in a Taiwan Hospital , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[58]  Ravi Kumar,et al.  On anonymizing query logs via token-based hashing , 2007, WWW '07.

[59]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[60]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[61]  Markus Hegland,et al.  A parallel solver for generalised additive models , 1999 .

[62]  Roman Słowiński,et al.  Intelligent Decision Support , 1992, Theory and Decision Library.

[63]  Benny Pinkas,et al.  Cryptographic techniques for privacy-preserving data mining , 2002, SKDD.

[64]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[65]  Steven P. Reiss Practical Data-Swapping: The First Steps , 1980, 1980 IEEE Symposium on Security and Privacy.

[66]  Nikos Mamoulis,et al.  Privacy Preservation in the Publication of Trajectories , 2008, The Ninth International Conference on Mobile Data Management (mdm 2008).

[67]  Murat Kantarcioglu,et al.  A Survey of Privacy-Preserving Methods Across Horizontally Partitioned Data , 2008, Privacy-Preserving Data Mining.

[68]  Latanya Sweeney,et al.  Datafly: A System for Providing Anonymity in Medical Data , 1997, DBSec.

[69]  L. Zayatz Disclosure avoidance practices and research at the U.S. Census Bureau: an update , 2007 .

[70]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[71]  Peng Zhang,et al.  Privacy Preserving Naive Bayes Classification , 2005, ADMA.

[72]  Bhavani M. Thuraisingham,et al.  Security checking in relational database management systems augmented with inference engines , 1987, Comput. Secur..

[73]  Sheng Zhong,et al.  Distributed Data Mining Protocols for Privacy: A Review of Some Recent Results , 2005, MADNES.

[74]  Francesco Bonchi,et al.  Never Walk Alone: Uncertainty for Anonymity in Moving Objects Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[75]  Yufei Tao,et al.  M-invariance: towards privacy preserving re-publication of dynamic datasets , 2007, SIGMOD '07.

[76]  Dan Suciu,et al.  A formal analysis of information disclosure in data exchange , 2004, SIGMOD '04.

[77]  Raymond Chi-Wing Wong,et al.  Privacy preserving serial data publishing by role composition , 2008, Proc. VLDB Endow..

[78]  Cynthia Dwork,et al.  On Privacy-Preserving Histograms , 2005, UAI.

[79]  José Meseguer,et al.  Unwinding and Inference Control , 1984, 1984 IEEE Symposium on Security and Privacy.

[80]  Norman S. Matloff,et al.  Inference Control Via Query Restriction Vs. Data Modification: A Perspective , 1988, DBSec.

[81]  Raymond Chi-Wing Wong,et al.  FF-Anonymity: When Quasi-identifiers Are Missing , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[82]  Wenliang Du,et al.  Using randomized response techniques for privacy-preserving data mining , 2003, KDD '03.

[83]  Samir Khuller,et al.  Achieving anonymity via clustering , 2006, PODS '06.

[84]  Markus Jakobsson,et al.  Making Mix Nets Robust for Electronic Voting by Randomized Partial Checking , 2002, USENIX Security Symposium.

[85]  Philip S. Yu,et al.  A framework for condensation-based anonymization of string data , 2008, Data Mining and Knowledge Discovery.

[86]  David J. DeWitt,et al.  Workload-aware anonymization , 2006, KDD '06.

[87]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[88]  Mukesh K. Mohania,et al.  Efficient techniques for document sanitization , 2008, CIKM '08.

[89]  Philip S. Yu,et al.  Handicapping attacker's confidence: an alternative to k-anonymization , 2006, Knowledge and Information Systems.

[90]  Elisa Bertino,et al.  Secure Anonymization for Incremental Datasets , 2006, Secure Data Management.

[91]  Gultekin Özsoyoglu,et al.  On Inference Control in Semantic Data Models for Statistical Databases , 1990, J. Comput. Syst. Sci..

[92]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[93]  Chris Clifton,et al.  Thoughts on k-Anonymization , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[94]  Charu C. Aggarwal,et al.  On privacy preservation against adversarial data mining , 2006, KDD '06.

[95]  Chris Clifton,et al.  Using Sample Size to Limit Exposure to Data Mining , 2000, J. Comput. Secur..

[96]  Yufei Tao,et al.  Personalized privacy preservation , 2006, Privacy-Preserving Data Mining.

[97]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[98]  Sushil Jajodia,et al.  Checking for k-Anonymity Violation by Views , 2005, VLDB.

[99]  Chris Clifton,et al.  When do data mining results violate privacy? , 2004, KDD.

[100]  Staal A. Vinterbo,et al.  Privacy: a machine learning view , 2004, IEEE Transactions on Knowledge and Data Engineering.

[101]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[102]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[103]  Chris Clifton,et al.  Privacy-preserving Naïve Bayes classification , 2008, The VLDB Journal.

[104]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[105]  Jian Pei,et al.  Publishing Sensitive Transactions for Itemset Utility , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[106]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[107]  Philip S. Yu,et al.  Bottom-up generalization: a data mining solution to privacy protection , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[108]  Panos Kalnis,et al.  On the Anonymization of Sparse High-Dimensional Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[109]  Johannes Gehrke Models and Methods for Privacy-Preserving Data Analysis and Publishing , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[110]  Traian Marius Truta,et al.  Protection : p-Sensitive k-Anonymity Property , 2006 .

[111]  Benjamin C. M. Fung,et al.  Integrating Private Databases for Data Analysis , 2005, ISI.

[112]  Dimitrios Kokkinakis,et al.  Anonymisation of Swedish Clinical Data , 2007, AIME.

[113]  Jian Pei,et al.  Anonymity for continuous data publishing , 2008, EDBT '08.

[114]  Jacques J. Vidal,et al.  Process control with adaptive range coding , 1992, Biological Cybernetics.

[115]  Elisa Bertino,et al.  Association rule hiding , 2004, IEEE Transactions on Knowledge and Data Engineering.

[116]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[117]  Philip S. Yu,et al.  Privacy-Preserving Data Mining - Models and Algorithms , 2008, Advances in Database Systems.

[118]  Aaron Roth,et al.  A learning theory approach to noninteractive database privacy , 2011, JACM.

[119]  Bhavani M. Thuraisingham,et al.  Web and information security , 2002 .

[120]  Steven P. Reiss,et al.  Non-reversible privacy transformations , 1982, PODS '82.

[121]  Josep Domingo-Ferrer,et al.  A Survey of Inference Control Methods for Privacy-Preserving Data Mining , 2008, Privacy-Preserving Data Mining.

[122]  Philip S. Yu,et al.  Anonymizing Classification Data for Privacy Preservation , 2007, IEEE Transactions on Knowledge and Data Engineering.

[123]  Qi Wang,et al.  On the privacy preserving properties of random data perturbation techniques , 2003, Third IEEE International Conference on Data Mining.

[124]  Dan Suciu,et al.  The Boundary Between Privacy and Utility in Data Publishing , 2007, VLDB.

[125]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[126]  Ruth Brand,et al.  Microdata Protection through Noise Addition , 2002, Inference Control in Statistical Databases.

[127]  Thomas H. Hinke,et al.  Inference aggregation detection in database management systems , 1988, Proceedings. 1988 IEEE Symposium on Security and Privacy.

[128]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[129]  Raymond Chi-Wing Wong,et al.  (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing , 2006, KDD '06.

[130]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[131]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[132]  Benjamin C. M. Fung,et al.  A framework for privacy-preserving cluster analysis , 2008, 2008 IEEE International Conference on Intelligence and Security Informatics.

[133]  Philip S. Yu,et al.  On Privacy-Preservation of Text and Sparse Binary Data with Sketches , 2007, SDM.

[134]  Stephen C. Pohlig,et al.  An Improved Algorithm for Computing Logarithms over GF(p) and Its Cryptographic Significance , 2022, IEEE Trans. Inf. Theory.

[135]  Sushil Jajodia,et al.  The inference problem: a survey , 2002, SKDD.

[136]  David P. Woodruff,et al.  Epistemic privacy , 2008, JACM.

[137]  Chris Clifton,et al.  Tools for privacy preserving distributed data mining , 2002, SKDD.

[138]  Benjamin C. M. Fung,et al.  Privacy-preserving data mashup , 2009, EDBT '09.

[139]  Rajeev Motwani,et al.  Anonymizing Tables , 2005, ICDT.

[140]  W. Winkler,et al.  MASKING MICRODATA FILES , 1995 .

[141]  George T. Duncan,et al.  Obtaining Information while Preserving Privacy: A Markov Perturbation Method for Tabular Data , 1997 .

[142]  Dino Pedreschi,et al.  Privacy-Aware Knowledge Discovery from Location Data , 2007, 2007 International Conference on Mobile Data Management.

[143]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[144]  Sheng Zhong,et al.  Anonymity-preserving data collection , 2005, KDD '05.