A Quantitative Approach for Evaluating the Utility of a Differentially Private Behavioral Science Dataset

Social scientists who collect large amounts of medical data value the privacy of their survey participants. As they follow participants through longitudinal studies, they develop unique profiles of these individuals. A growing challenge for these researchers is to maintain the privacy of their study participants, while sharing their data to facilitate research. Differential privacy is a new mechanism which promises improved privacy guarantees for statistical databases. We evaluate the utility of a differentially private dataset. Our results align with the theory of differential privacy and show when the number of records in the database is sufficiently larger than the number of cells covered by a database query, the number of statistical tests with results close to those performed on original data increases.

[1]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[2]  Khaled El Emam,et al.  Protecting privacy using k-anonymity. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[3]  Haoran Li,et al.  DPCube: Differentially Private Histogram Release through Multidimensional Partitioning , 2012, Trans. Data Priv..

[4]  Haim Kaplan,et al.  Private coresets , 2009, STOC '09.

[5]  R. Platt,et al.  Distributed Health Data Networks: A Practical and Preferred Approach to Multi-Institutional Evaluations of Comparative Effectiveness, Safety, and Quality of Care , 2010, Medical care.

[6]  Roy Pardee,et al.  Managing personal health information in distributed research network environments , 2013, BMC Medical Informatics and Decision Making.

[7]  Ashwin Machanavajjhala,et al.  Privacy in Search Logs , 2009, ArXiv.

[8]  Nina Mishra,et al.  Releasing search queries and clicks privately , 2009, WWW '09.

[9]  D. Watson,et al.  Constructing validity: Basic issues in objective scale development , 1995 .

[10]  Khaled El Emam,et al.  The application of differential privacy to health data , 2012, EDBT-ICDT '12.

[11]  L. Beskow,et al.  Certificates of Confidentiality and Compelled Disclosure of Data , 2008, Science.

[12]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[13]  Ilya Mironov,et al.  Differentially private recommender systems: building privacy into the net , 2009, KDD.

[14]  Raquel Hill,et al.  Uniqueness and how it impacts privacy in health-related social science datasets , 2012, IHI '12.

[15]  Chun Yuan,et al.  Differentially Private Data Release through Multidimensional Partitioning , 2010, Secure Data Management.

[16]  J. Higgins,et al.  Perspectives on Sexual and Reproductive Health Arousal Loss Related to Safer Sex and Risk of Pregnancy: Implications for Women's and Men's Sexual Health , 2022 .

[17]  Lawrence O. Gostin,et al.  Code of Federal Regulations Title 45: Public Welfare Part 46: Protection of Human Subjects , 2007 .

[18]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[19]  Isaac S. Kohane,et al.  Strategies for maintaining patient privacy in i2b2 , 2011, J. Am. Medical Informatics Assoc..

[20]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[21]  Deborah A. Nichols,et al.  Strategies for De-identification and Anonymization of Electronic Health Record Data for Use in Multicenter Research Studies , 2012, Medical care.

[22]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[23]  Elisa Bertino,et al.  Private record matching using differential privacy , 2010, EDBT '10.

[24]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[25]  Philip S. Yu,et al.  A Survey of Randomization Methods for Privacy-Preserving Data Mining , 2008, Privacy-Preserving Data Mining.

[26]  Moni Naor,et al.  On the complexity of differentially private data release: efficient algorithms and hardness results , 2009, STOC '09.