PrivLogit: Efficient Privacy-preserving Logistic Regression by Tailoring Numerical Optimizers

Safeguarding privacy in machine learning is highly desirable, especially in collaborative studies across many organizations. Privacy-preserving distributed machine learning (based on cryptography) is popular to solve the problem. However, existing cryptographic protocols still incur excess computational overhead. Here, we make a novel observation that this is partially due to naive adoption of mainstream numerical optimization (e.g., Newton method) and failing to tailor for secure computing. This work presents a contrasting perspective: customizing numerical optimization specifically for secure settings. We propose a seemingly less-favorable optimization method that can in fact significantly accelerate privacy-preserving logistic regression. Leveraging this new method, we propose two new secure protocols for conducting logistic regression in a privacy-preserving and distributed manner. Extensive theoretical and empirical evaluations prove the competitive performance of our two secure proposals while without compromising accuracy or privacy: with speedup up to 2.3x and 8.1x, respectively, over state-of-the-art; and even faster as data scales up. Such drastic speedup is on top of and in addition to performance improvements from existing (and future) state-of-the-art cryptography. Our work provides a new way towards efficient and practical privacy-preserving logistic regression for large-scale studies which are common for modern science.

[1]  James E. Helmreich Regression Modeling Strategies with Applications to Linear Models, Logistic and Ordinal Regression and Survival Analysis (2nd Edition) , 2016 .

[2]  Philip S. Yu,et al.  A General Survey of Privacy-Preserving Data Mining Models and Algorithms , 2008, Privacy-Preserving Data Mining.

[3]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[4]  Oded Goldreich Foundations of Cryptography: Index , 2001 .

[5]  Kartik Nayak,et al.  ObliVM: A Programming Framework for Secure Computation , 2015, 2015 IEEE Symposium on Security and Privacy.

[6]  Jihoon Kim,et al.  Grid Binary LOgistic REgression (GLORE): building shared models without sharing data , 2012, J. Am. Medical Informatics Assoc..

[7]  Kamalika Chaudhuri,et al.  Privacy-preserving logistic regression , 2008, NIPS.

[8]  Pascal Paillier,et al.  Public-Key Cryptosystems Based on Composite Degree Residuosity Classes , 1999, EUROCRYPT.

[9]  Martin Wattenberg,et al.  Ad click prediction: a view from the trenches , 2013, KDD.

[10]  Ya Zhang,et al.  A machine learning-based framework to identify type 2 diabetes through electronic health records , 2017, Int. J. Medical Informatics.

[11]  Jose C Florez,et al.  Introduction to genetic association studies. , 2007, The Journal of investigative dermatology.

[12]  F. Collins,et al.  Keeping pace with the times--the Genetic Information Nondiscrimination Act of 2008. , 2008, The New England journal of medicine.

[13]  Stratis Ioannidis,et al.  Privacy-Preserving Ridge Regression on Hundreds of Millions of Records , 2013, 2013 IEEE Symposium on Security and Privacy.

[14]  E G Lowrie,et al.  Death risk in hemodialysis patients: the predictive value of commonly measured variables and an evaluation of death rate differences between facilities. , 1990, American journal of kidney diseases : the official journal of the National Kidney Foundation.

[15]  Casey S. Greene,et al.  Semi-supervised learning of the electronic health record for phenotype stratification , 2016, J. Biomed. Informatics.

[16]  Raymond Heatherly,et al.  SecureMA: protecting participant privacy in genetic association meta-analysis , 2014, Bioinform..

[17]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[18]  Paulo Cortez,et al.  Modeling wine preferences by data mining from physicochemical properties , 2009, Decis. Support Syst..

[19]  B. Lindsay,et al.  Monotonicity of quadratic-approximation algorithms , 1988 .

[20]  Hhs Office for Civil Rights Standards for privacy of individually identifiable health information. Final rule. , 2002, Federal register.

[21]  James O. Chipperfield,et al.  A Summary of Attack Methods and Confidentiality Protection Measures for Fully Automated Remote Analysis Systems , 2013 .

[22]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[23]  Wei Zhao,et al.  Distributed Privacy Preserving Information Sharing , 2005, VLDB.

[24]  Rob Hall,et al.  Achieving Both Valid and Secure Logistic Regression Analysis on Aggregated Data from Different Private Sources , 2012, J. Priv. Confidentiality.

[25]  Xiaoqian Jiang,et al.  Preserving Institutional Privacy in Distributed binary Logistic Regression. , 2012, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[26]  Murat Kantarcioglu,et al.  A secure distributed logistic regression protocol for the detection of rare adverse drug events , 2012, J. Am. Medical Informatics Assoc..

[27]  Paulo Cortez,et al.  A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News , 2015, EPIA.

[28]  Yang Wang,et al.  Maintained Individual Data Distributed Likelihood Estimation (MIDDLE) , 2015, Multivariate behavioral research.

[29]  C. Greene,et al.  Semi-Supervised Learning of the Electronic Health Record with Denoising Autoencoders for Phenotype Stratification , 2016 .

[30]  Yoshinori Aono,et al.  Scalable and Secure Logistic Regression via Homomorphic Encryption , 2016, IACR Cryptol. ePrint Arch..

[31]  M. Tobin,et al.  DataSHIELD: resolving a conflict in contemporary bioscience—performing a pooled analysis of individual-level data without sharing the data , 2010, International journal of epidemiology.

[32]  Justin Reich,et al.  Privacy, anonymity, and big data in the social sciences , 2014, Commun. ACM.

[33]  Wendy A. Wolf,et al.  The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies , 2011, BMC Medical Genomics.

[34]  Casey S. Greene,et al.  Semi-Supervised Learning of the Electronic Health Record for Phenotype Stratification , 2016, bioRxiv.

[35]  Andrew Chi-Chih Yao,et al.  Protocols for secure computations , 1982, FOCS 1982.

[36]  Xiaodong Lin,et al.  Secure, Privacy-Preserving Analysis of Distributed Databases , 2007, Technometrics.

[37]  Jan Willemson,et al.  Secure floating point arithmetic and private satellite collision analysis , 2015, International Journal of Information Security.

[38]  Wei Xie,et al.  Supporting Regularized Logistic Regression Privately and Efficiently , 2015, PloS one.

[39]  Elisa Bertino,et al.  State-of-the-art in privacy preserving data mining , 2004, SGMD.