A General Survey of Privacy-Preserving Data Mining Models and Algorithms

In recent years, privacy-preserving data mining has been studied extensively, because of the wide proliferation of sensitive information on the internet. A number of algorithmic techniques have been designed for privacy-preserving data mining. In this paper, we provide a review of the state-of-the-art methods for privacy. We discuss methods for randomization, k-anonymization, and distributed privacy-preserving data mining. We also discuss cases in which the output of data mining applications needs to be sanitized for privacy-preservation purposes. We discuss the computational and theoretical limits associated with privacy-preservation over high dimensional data sets.

[1]  Jaideep Vaidya,et al.  Privacy-preserving indexing of documents on the network , 2003, The VLDB Journal.

[2]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[3]  Lei Liu,et al.  Optimal randomization for privacy preserving data mining , 2004, KDD.

[4]  Charu C. Aggarwal,et al.  On privacy preservation against adversarial data mining , 2006, KDD '06.

[5]  Yufei Tao,et al.  Personalized privacy preservation , 2006, Privacy-Preserving Data Mining.

[6]  Mikhail J. Atallah,et al.  A secure protocol for computing dot-products in clustered and distributed environments , 2002, Proceedings International Conference on Parallel Processing.

[7]  GangopadhyayAryya,et al.  A privacy-preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms , 2006, VLDB 2006.

[8]  Rajeev Motwani,et al.  Towards robustness in query auditing , 2006, VLDB.

[9]  Alexandre V. Evfimievski,et al.  Randomization in privacy preserving data mining , 2002, SKDD.

[10]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[11]  Sushil Jajodia,et al.  Checking for k-Anonymity Violation by Views , 2005, VLDB.

[12]  L. Sweeney,et al.  Preserving Privacy by De-identifying Facial Images , 2003 .

[13]  Yufei Tao,et al.  M-invariance: towards privacy preserving re-publication of dynamic datasets , 2007, SIGMOD '07.

[14]  Yunghsiang Sam Han,et al.  Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification , 2004, SDM.

[15]  Ashwin Machanavajjhala,et al.  Worst-Case Background Knowledge for Privacy-Preserving Data Publishing , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[16]  Richard J. Lipton,et al.  Secure databases: protection against user influence , 1979, TODS.

[17]  William E. Winkler,et al.  Using Simulated Annealing for k-anonymity , 2002 .

[18]  Hoeteck Wee,et al.  Toward Privacy in Public Databases , 2005, TCC.

[19]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[20]  Chris Clifton,et al.  Privacy-Preserving Decision Trees over Vertically Partitioned Data , 2005, DBSec.

[21]  Benny Pinkas,et al.  Cryptographic techniques for privacy-preserving data mining , 2002, SKDD.

[22]  Jian Pei,et al.  Utility-based anonymization using local recoding , 2006, KDD '06.

[23]  Philip S. Yu,et al.  On Privacy-Preservation of Text and Sparse Binary Data with Sketches , 2007, SDM.

[24]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[25]  Stephen E. Fienberg,et al.  Data Swapping: Variations on a Theme by Dalenius and Reiss , 2004, Privacy in Statistical Databases.

[26]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[27]  Maria E. Orlowska,et al.  A reconstruction-based algorithm for classification rules hiding , 2006, ADC.

[28]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[29]  Jaideep Vaidya,et al.  Privacy preserving association rule mining in vertically partitioned data , 2002, KDD.

[30]  Wenliang Du,et al.  Privacy-preserving cooperative statistical analysis , 2001, Seventeenth Annual Computer Security Applications Conference.

[31]  Elisa Bertino,et al.  A Framework for Evaluating Privacy Preserving Data Mining Algorithms* , 2005, Data Mining and Knowledge Discovery.

[32]  Philip S. Yu,et al.  Template-based privacy preservation in classification problems , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[33]  Chris Clifton,et al.  Privacy-preserving distributed mining of association rules on horizontally partitioned data , 2004, IEEE Transactions on Knowledge and Data Engineering.

[34]  Balázs Kégl,et al.  Privacy-preserving boosting , 2007, Data Mining and Knowledge Discovery.

[35]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[36]  Cynthia Dwork,et al.  Practical privacy: the SuLQ framework , 2005, PODS.

[37]  Wenliang Du,et al.  Deriving private information from randomized data , 2005, SIGMOD '05.

[38]  Rebecca N. Wright,et al.  A New Privacy-Preserving Distributed k-Clustering Algorithm , 2006, SDM.

[39]  Ralph Gross,et al.  Mining Images in Publicly-Available Cameras for Homeland Security , 2005, AAAI Spring Symposium: AI Technologies for Homeland Security.

[40]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[41]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[42]  Christos Faloutsos,et al.  Auditing Compliance with a Hippocratic Database , 2004, VLDB.

[43]  Rajeev Motwani,et al.  Approximation Algorithms for k-Anonymity , 2005 .

[44]  Philip S. Yu,et al.  On Anonymization of String Data , 2007, SDM.

[45]  Elisa Bertino,et al.  Association rule hiding , 2004, IEEE Transactions on Knowledge and Data Engineering.

[46]  Chris Clifton,et al.  Privacy-Preserving Distributed k-Anonymity , 2005, DBSec.

[47]  Kyuseok Shim,et al.  Approximate algorithms for K-anonymity , 2007, SIGMOD '07.

[48]  Latanya Sweeney,et al.  Guaranteeing anonymity when sharing medical data, the Datafly System , 1997, AMIA.

[49]  Chris Clifton,et al.  Privacy-preserving k-means clustering over vertically partitioned data , 2003, KDD '03.

[50]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[51]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[52]  Sushil Jajodia,et al.  Protecting Privacy Against Location-Based Personal Identification , 2005, Secure Data Management.

[53]  Chris Clifton,et al.  Using unknowns to prevent discovery of association rules , 2001, SGMD.

[54]  Jayant R. Haritsa,et al.  Maintaining Data Privacy in Association Rule Mining , 2002, VLDB.

[55]  Philip S. Yu,et al.  Privacy-Preserving Data Mining - Models and Algorithms , 2008, Advances in Database Systems.

[56]  Chris Clifton,et al.  SECURITY AND PRIVACY IMPLICATIONS OF DATA MINING , 1996 .

[57]  Moni Naor,et al.  Efficient oblivious transfer protocols , 2001, SODA '01.

[58]  Michael O. Rabin,et al.  How To Exchange Secrets with Oblivious Transfer , 2005, IACR Cryptol. ePrint Arch..

[59]  Aryya Gangopadhyay,et al.  A privacy-preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms , 2006, The VLDB Journal.

[60]  Ira S. Moskowitz,et al.  Parsimonious downgrading and decision trees applied to the inference problem , 1998, NSPW '98.

[61]  Qing Zhang,et al.  Aggregate Query Answering on Anonymized Tables , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[62]  Rebecca N. Wright,et al.  Privacy-preserving distributed k-means clustering over arbitrarily partitioned data , 2005, KDD '05.

[63]  Sheng Zhong,et al.  Privacy-enhancing k-anonymization of customer data , 2005, PODS.

[64]  Joachim Biskup,et al.  Controlled Query Evaluation for Known Policies by Combining Lying and Refusal , 2004, Annals of Mathematics and Artificial Intelligence.

[65]  David Chaum,et al.  Multiparty unconditionally secure protocols , 1988, STOC '88.

[66]  Philip S. Yu,et al.  A Condensation Approach to Privacy Preserving Data Mining , 2004, EDBT.

[67]  Wenliang Du,et al.  Privacy-preserving top-N recommendation on horizontally partitioned data , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[68]  Dorothy E. Denning,et al.  Secure statistical databases with random sample queries , 1980, TODS.

[69]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.

[70]  William E. Winkler,et al.  Multiplicative Noise for Masking Continuous Data , 2001 .

[71]  Philip S. Yu,et al.  Bottom-up generalization: a data mining solution to privacy protection , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[72]  Sheng Zhong,et al.  Privacy-Preserving Classification of Customer Data without Loss of Accuracy , 2005, SDM.

[73]  Vitaly Shmatikov,et al.  Information Hiding, Anonymity and Privacy: a Modular Approach , 2004, J. Comput. Secur..

[74]  Laks V. S. Lakshmanan,et al.  To do or not to do: the dilemma of disclosing anonymized data , 2005, SIGMOD '05.

[75]  Chris Clifton,et al.  Privacy Preserving Naïve Bayes Classifier for Vertically Partitioned Data , 2004, SDM.

[76]  Bradley Malin,et al.  Determining the identifiability of DNA database entries , 2000, AMIA.

[77]  Yücel Saygin,et al.  Privacy preserving clustering on horizontally partitioned data , 2007, Data Knowl. Eng..

[78]  Irit Dinur,et al.  Revealing information while preserving privacy , 2003, PODS.

[79]  Shouhuai Xu,et al.  k-anonymous secret handshakes with reusable credentials , 2004, CCS '04.

[80]  Jaideep Vaidya,et al.  Privacy Preserving Naive Bayes Classifier for Horizontally Partitioned Data , 2003 .

[81]  Gene Tsudik,et al.  A Privacy-Preserving Index for Range Queries , 2004, VLDB.

[82]  Gultekin Özsoyoglu,et al.  Auditing for secure statistical databases , 1981, ACM '81.

[83]  Chris Clifton,et al.  Tools for privacy preserving distributed data mining , 2002, SKDD.

[84]  Latanya Sweeney AI Technologies to Defeat Identity Theft Vulnerabilities , 2005, AAAI Spring Symposium: AI Technologies for Homeland Security.

[85]  Francis Y. L. Chin,et al.  Security problems on inference control for SUM, MAX, and MIN queries , 1986, JACM.

[86]  A. Yao,et al.  Fair exchange with a semi-trusted third party (extended abstract) , 1997, CCS '97.

[87]  Chong K. Liew,et al.  A data distortion by probability distribution , 1985, TODS.

[88]  Osmar R. Zaïane,et al.  Privacy Preserving Clustering by Data Transformation , 2010, J. Inf. Data Manag..

[89]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[90]  Rajeev Motwani,et al.  Anonymizing Tables , 2005, ICDT.

[91]  Moni Naor,et al.  Our Data, Ourselves: Privacy Via Distributed Noise Generation , 2006, EUROCRYPT.

[92]  David J. DeWitt,et al.  Workload-aware anonymization , 2006, KDD '06.

[93]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[94]  Chris Clifton,et al.  Privacy-preserving Naïve Bayes classification , 2008, The VLDB Journal.

[95]  Yücel Saygin,et al.  Secure Association Rule Sharing , 2004, PAKDD.

[96]  Jimeng Sun,et al.  Hiding in the Crowd: Privacy Preservation on Evolving Streams through Correlation Tracking , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[97]  Osmar R. Zaïane,et al.  Data Perturbation by Rotation for Privacy-Preserving Clustering , 2004 .

[98]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[99]  Jaideep Vaidya,et al.  Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data , 2006, SAC.

[100]  Keke Chen,et al.  Privacy preserving data classification with rotation perturbation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[101]  Yücel Saygin,et al.  Privacy Preserving Clustering on Horizontally Partitioned Data , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[102]  Steven P. Reiss Security in Databases: A Combinatorial Study , 1979, JACM.

[103]  Kun Liu,et al.  An Attacker's View of Distance Preserving Maps for Privacy Preserving Data Mining , 2006, PKDD.

[104]  Kun Liu,et al.  Random projection-based multiplicative data perturbation for privacy preserving distributed data mining , 2006, IEEE Transactions on Knowledge and Data Engineering.

[105]  B A Malin,et al.  Protecting Genomic Sequence Anonymity with Generalization Lattices , 2005, Methods of Information in Medicine.

[106]  Oded Goldreich,et al.  A randomized protocol for signing contracts , 1985, CACM.

[107]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[108]  Alexandre V. Evfimievski,et al.  Limiting privacy breaches in privacy preserving data mining , 2003, PODS.

[109]  Jian Pei,et al.  Maintaining K-Anonymity against Incremental Updates , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[110]  C. Dwork,et al.  On the Utility of Privacy-Preserving Histograms , 2004 .

[111]  Bradley Malin,et al.  Re-identification of DNA through an automated linkage process , 2001, AMIA.

[112]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[113]  Philip S. Yu,et al.  On Variable Constraints in Privacy Preserving Data Mining , 2005, SDM.

[114]  Daniel Kifer,et al.  Injecting utility into anonymized datasets , 2006, SIGMOD Conference.

[115]  S L Warner,et al.  Randomized response: a survey technique for eliminating evasive answer bias. , 1965, Journal of the American Statistical Association.

[116]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[117]  Jon M. Kleinberg,et al.  Auditing Boolean attributes , 2000, PODS.

[118]  Benjamin C. M. Fung,et al.  Integrating Private Databases for Data Analysis , 2005, ISI.

[119]  Ramakrishnan Srikant,et al.  Privacy preserving OLAP , 2005, SIGMOD '05.

[120]  Nina Mishra,et al.  Simulatable auditing , 2005, PODS.

[121]  Cynthia Dwork,et al.  Privacy-Preserving Datamining on Vertically Partitioned Databases , 2004, CRYPTO.

[122]  Elisa Bertino,et al.  State-of-the-art in privacy preserving data mining , 2004, SGMD.

[123]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[124]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[125]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[126]  Samir Khuller,et al.  Achieving anonymity via clustering , 2006, PODS '06.

[127]  Ling Liu,et al.  A Customizable k-Anonymity Model for Protecting Location Privacy , 2004 .

[128]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2000, Journal of Cryptology.

[129]  Dengguo Feng,et al.  A New k-Anonymous Message Transmission Protocol , 2004, WISA.

[130]  Elisa Bertino,et al.  Hiding Association Rules by Using Confidence and Support , 2001, Information Hiding.

[131]  Bradley Malin,et al.  Preserving privacy by de-identifying face images , 2005, IEEE Transactions on Knowledge and Data Engineering.

[132]  Qi Wang,et al.  On the privacy preserving properties of random data perturbation techniques , 2003, Third IEEE International Conference on Data Mining.

[133]  Arbee L. P. Chen,et al.  Hiding Sensitive Association Rules with Limited Side Effects , 2007, IEEE Transactions on Knowledge and Data Engineering.

[134]  Jaideep Vaidya,et al.  Privacy-Preserving SVM Classification on Vertically Partitioned Data , 2006, PAKDD.

[135]  Jeffrey D. Ullman,et al.  A model of statistical database their security , 1977, TODS.

[136]  Yücel Saygin,et al.  Privacy preserving association rule mining , 2002, Proceedings Twelfth International Workshop on Research Issues in Data Engineering: Engineering E-Commerce/E-Business Systems RIDE-2EC 2002.

[137]  Wenliang Du,et al.  SVD-based collaborative filtering with privacy , 2005, SAC '05.

[138]  Charu C. Aggarwal,et al.  On Randomization, Public Information and the Curse of Dimensionality , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[139]  Bradley Malin,et al.  Protecting DNA Sequence Anonymity with Generalization Lattices , 2004 .

[140]  Vassilios S. Verykios,et al.  Disclosure limitation of sensitive rules , 1999, Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX'99) (Cat. No.PR00453).

[141]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .