PEM: A Practical Differentially Private System for Large-Scale Cross-Institutional Data Mining

Privacy has become a serious concern in data mining. Achieving adequate privacy is especially challenging when the scale of the problem is large. Fundamentally, designing a practical privacy-preserving data mining system involves tradeoffs among several factors such as the privacy guarantee, the accuracy or utility of the mining result, the computation efficiency and the generality of the approach. In this paper, we present PEM, a practical system that tries to strike the right balance among these factors. We use a combination of noise-based and noise-free techniques to achieve provable differential privacy at a low computational overhead while obtaining more accurate result than previous approaches. PEM provides an efficient private gradient descent that can be the basis for many practical data mining and machine learning algorithms, like logistic regression, k-means, and Apriori. We evaluate these algorithms on three real-world open datasets in a cloud computing environment. The results show that PEM achieves good accuracy, high scalability, low computation cost while maintaining differential privacy.

[1]  Suman Nath,et al.  Differentially private aggregation of distributed time-series with transformation and encryption , 2010, SIGMOD Conference.

[2]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[3]  M. H. Margahny,et al.  FAST ALGORITHM FOR MINING ASSOCIATION RULES , 2014 .

[4]  Dan Bogdanov,et al.  Sharemind: A Framework for Fast Privacy-Preserving Computations , 2008, ESORICS.

[5]  Bhiksha Raj,et al.  Multiparty Differential Privacy via Aggregation of Locally Trained Classifiers , 2010, NIPS.

[6]  Elisa Bertino,et al.  Differentially Private K-Means Clustering , 2015, CODASPY.

[7]  Assaf Schuster,et al.  Privacy-Preserving Distributed Stream Monitoring , 2014, NDSS.

[8]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[9]  Raghav Bhaskar,et al.  Noiseless Database Privacy , 2011, ASIACRYPT.

[10]  Andrew Chi-Chih Yao,et al.  Protocols for secure computations , 1982, FOCS 1982.

[11]  Yuguang Fang,et al.  Privacy-Preserving Data Classification and Similarity Evaluation for Distributed Systems , 2016, 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS).

[12]  Hassan Takabi,et al.  Differentially Private Distributed Data Analysis , 2016, 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC).

[13]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[14]  Hari Balakrishnan,et al.  CryptDB: protecting confidentiality with encrypted query processing , 2011, SOSP.

[15]  Vitaly Shmatikov,et al.  Privacy-preserving deep learning , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[16]  Ning Zhang,et al.  Distributed Data Mining with Differential Privacy , 2011, 2011 IEEE International Conference on Communications (ICC).

[17]  Moni Naor,et al.  Pan-Private Streaming Algorithms , 2010, ICS.

[18]  Xenofontas A. Dimitropoulos,et al.  SEPIA: Privacy-Preserving Aggregation of Multi-Domain Network Events and Statistics , 2010, USENIX Security Symposium.

[19]  Adi Shamir,et al.  How to share a secret , 1979, CACM.

[20]  Benny Pinkas,et al.  FairplayMP: a system for secure multi-party computation , 2008, CCS.

[21]  Duan,et al.  Differential Privacy for Sum Queries without External Noise ∗ † Yitao , 2010 .

[22]  Yitao Duan Privacy without noise , 2009, CIKM.

[23]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[24]  Rathindra Sarathy,et al.  Evaluating Laplace Noise Addition to Satisfy Differential Privacy for Numeric Data , 2011, Trans. Data Priv..

[25]  Yitao Duan,et al.  P4P: Practical Large-Scale Privacy-Preserving Distributed Computation Robust against Malicious Users , 2010, USENIX Security Symposium.

[26]  Ninghui Li,et al.  On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy , 2011, ASIACCS '12.

[27]  Anand D. Sarwate,et al.  A near-optimal algorithm for differentially-private principal components , 2012, J. Mach. Learn. Res..

[28]  Yuguang Fang,et al.  Privacy-Preserving Machine Learning Algorithms for Big Data Systems , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[29]  Roksana Boreli,et al.  Applying Differential Privacy to Matrix Factorization , 2015, RecSys.

[30]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.