DPWeka: Achieving Differential Privacy in WEKA A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science by

Organizations belonging to the government, commercial, and non-profit industries collect and store large amounts of sensitive data, which include medical, financial, and personal information. They use data mining methods to formulate business strategies that yield high longterm and short-term financial benefits. While analyzing such data, the private information of the individuals present in the data must be protected for moral and legal reasons. Current practices such as redacting sensitive attributes, releasing only the aggregate values, and query auditing do not provide sufficient protection against an adversary armed with auxiliary information. In the presence of additional background information, the privacy protection framework, differential privacy, provides mathematical guarantees against adversarial attacks. Existing platforms for differential privacy employ specific mechanisms for limited applications of data mining. Additionally, widely used data mining tools do not contain differentially private data mining algorithms. As a result, for analyzing sensitive data, the cognizance of differentially private methods is currently limited outside the research community. This thesis examines various mechanisms to realize differential privacy in practice and investigates methods to integrate them with a popular machine learning toolkit, WEKA. We present DPWeka, a package that provides differential privacy capabilities to WEKA, for practical data mining. DPWeka includes a suite of differential privacy preserving algorithms which support a variety of data mining tasks including attribute selection and regression analysis. It has provisions for users to control privacy and model parameters, such as privacy mechanism, privacy budget, and other algorithm specific variables. We evaluate private algorithms on realworld datasets, such as genetic data and census data, to demonstrate the practical applicability of DPWeka.

[1]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[2]  Chris Clifton,et al.  When do data mining results violate privacy? , 2004, KDD.

[3]  Zhen Lin,et al.  Genomic Research and Human Subject Privacy , 2004, Science.

[4]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[5]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[6]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[7]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[8]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[9]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[10]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[11]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[12]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[13]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[14]  Murat Kantarcioglu,et al.  A Cryptographic Approach to Securely Share and Query Genomic Sequences , 2008, IEEE Transactions on Information Technology in Biomedicine.

[15]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[16]  Tim Roughgarden,et al.  Universally utility-maximizing privacy mechanisms , 2008, STOC '09.

[17]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[18]  Vitaly Shmatikov,et al.  Airavat: Security and Privacy for MapReduce , 2010, NSDI.

[19]  Assaf Schuster,et al.  Data mining with differential privacy , 2010, KDD.

[20]  Stephen E. Fienberg,et al.  Privacy Preserving GWAS Data Sharing , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[21]  C. Dwork A firm foundation for private data analysis , 2011, Commun. ACM.

[22]  Anand D. Sarwate,et al.  Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..

[23]  Sampath Kannan,et al.  The Exponential Mechanism for Social Welfare: Private, Truthful, and Nearly Optimal , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[24]  Murat Kantarcioglu,et al.  Secure Management of Biomedical Data With Cryptographic Hardware , 2012, IEEE Transactions on Information Technology in Biomedicine.

[25]  Daniel Kifer,et al.  Private Convex Empirical Risk Minimization and High-dimensional Regression , 2012, COLT 2012.

[26]  Yin Yang,et al.  Functional Mechanism: Regression Analysis under Differential Privacy , 2012, Proc. VLDB Endow..

[27]  Xintao Wu,et al.  Using aggregate human genome data for individual identification , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[28]  Vitaly Shmatikov,et al.  Privacy-preserving data exploration in genome-wide association studies , 2013, KDD.

[29]  Leting Wu,et al.  Differential Privacy Preserving Spectral Graph Analysis , 2013, PAKDD.

[30]  Xintao Wu,et al.  Preserving Differential Privacy in Degree-Correlation based Graph Generation , 2013, Trans. Data Priv..

[31]  Anand D. Sarwate,et al.  A near-optimal algorithm for differentially-private principal components , 2012, J. Mach. Learn. Res..

[32]  Prateek Jain,et al.  Differentially Private Learning with Kernels , 2013, ICML.

[33]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[34]  Stephen E. Fienberg,et al.  Scalable privacy-preserving data sharing methodology for genome-wide association studies , 2014, J. Biomed. Informatics.

[35]  Charles Elkan,et al.  Differential Privacy and Machine Learning: a Survey and Review , 2014, ArXiv.

[36]  Stephen E. Fienberg,et al.  Differentially-Private Logistic Regression for Detecting Multiple-SNP Association in GWAS Databases , 2014, Privacy in Statistical Databases.

[37]  P. Bayer,et al.  openSNP–A Crowdsourced Web Resource for Personal Genomics , 2014, PloS one.

[38]  Pramod Viswanath,et al.  The optimal mechanism in differential privacy , 2012, 2014 IEEE International Symposium on Information Theory.

[39]  Xintao Wu,et al.  Regression Model Fitting under Differential Privacy and Model Inversion Attack , 2015, IJCAI.

[40]  Pramod Viswanath,et al.  The Staircase Mechanism in Differential Privacy , 2015, IEEE Journal of Selected Topics in Signal Processing.

[41]  Xintao Wu,et al.  Infringement of Individual Privacy via Mining Differentially Private GWAS Statistics , 2016, BigCom.

[42]  Dejing Dou,et al.  Differential Privacy Preservation for Deep Auto-Encoders: an Application of Human Behavior Prediction , 2016, AAAI.

[43]  Bonnie Berger,et al.  Realizing privacy preserving genome-wide association studies , 2016, Bioinform..

[44]  Xintao Wu,et al.  Using Randomized Response for Differential Privacy Preserving Data Collection , 2016, EDBT/ICDT Workshops.

[45]  Lu Zhang,et al.  Building Bayesian networks from GWAS statistics based on Independence of Causal Influence , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[46]  Xintao Wu,et al.  An overview of human genetic privacy , 2017, Annals of the New York Academy of Sciences.