Inference attacks against differentially private query results from genomic datasets including dependent tuples

Abstract Motivation The rapid decrease in the sequencing technology costs leads to a revolution in medical research and clinical care. Today, researchers have access to large genomic datasets to study associations between variants and complex traits. However, availability of such genomic datasets also results in new privacy concerns about personal information of the participants in genomic studies. Differential privacy (DP) is one of the rigorous privacy concepts, which received widespread interest for sharing summary statistics from genomic datasets while protecting the privacy of participants against inference attacks. However, DP has a known drawback as it does not consider the correlation between dataset tuples. Therefore, privacy guarantees of DP-based mechanisms may degrade if the dataset includes dependent tuples, which is a common situation for genomic datasets due to the inherent correlations between genomes of family members. Results In this article, using two real-life genomic datasets, we show that exploiting the correlation between the dataset participants results in significant information leak from differentially private results of complex queries. We formulate this as an attribute inference attack and show the privacy loss in minor allele frequency (MAF) and chi-square queries. Our results show that using the results of differentially private MAF queries and utilizing the dependency between tuples, an adversary can reveal up to 50% more sensitive information about the genome of a target (compared to original privacy guarantees of standard DP-based mechanisms), while differentially privacy chi-square queries can reveal up to 40% more sensitive information. Furthermore, we show that the adversary can use the inferred genomic data obtained from the attribute inference attack to infer the membership of a target in another genomic dataset (e.g. associated with a sensitive trait). Using a log-likelihood-ratio test, our results also show that the inference power of the adversary can be significantly high in such an attack even using inferred (and hence partially incorrect) genomes. Availability and implementation https://github.com/nourmadhoun/Inference-Attacks-Differential-Privacy

[1]  Brian L. Browning,et al.  A one penny imputed genome from next generation reference panels , 2018, bioRxiv.

[2]  Manuel Corpas,et al.  Crowdsourcing the Corpasome , 2013, Source Code for Biology and Medicine.

[3]  Fei Yu,et al.  Scalable privacy-preserving data sharing methodology for genome-wide association studies: an application to iDASH healthcare privacy protection challenge , 2014, BMC Medical Informatics and Decision Making.

[4]  Michael I. Jordan,et al.  Genomic privacy and limits of individual detection in a pool , 2009, Nature Genetics.

[5]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[6]  Henri-Corto Stoeklé,et al.  23andMe: a new two-sided data-banking market model , 2016, BMC Medical Ethics.

[7]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[8]  Erman Ayday,et al.  Differential privacy under dependent tuples - the case of genomic privacy , 2019, Bioinform..

[9]  Prateek Mittal,et al.  Dependence Makes You Vulnberable: Differential Privacy Under Dependent Tuples , 2016, NDSS.

[10]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[11]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[12]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[13]  H. Vincent Poor,et al.  Dependent Differential Privacy for Correlated Data , 2017, 2017 IEEE Globecom Workshops (GC Wkshps).

[14]  Vitaly Shmatikov,et al.  Privacy-preserving data exploration in genome-wide association studies , 2013, KDD.

[15]  D. Goldstein,et al.  Sequencing studies in human genetics: design and interpretation , 2013, Nature Reviews Genetics.

[16]  Carl A. Gunter,et al.  Privacy in the Genomic Era , 2014, ACM Comput. Surv..

[17]  Bo Peng,et al.  To Release or Not to Release: Evaluating Information Leaks in Aggregate Human-Genome Data , 2011, ESORICS.

[18]  Jean-Pierre Hubaux,et al.  De-anonymizing Genomic Databases Using Phenotypic Traits , 2015, Proc. Priv. Enhancing Technol..

[19]  Stephen E. Fienberg,et al.  Scalable privacy-preserving data sharing methodology for genome-wide association studies , 2014, J. Biomed. Informatics.

[20]  Ashwin Machanavajjhala,et al.  No free lunch in data privacy , 2011, SIGMOD '11.

[21]  Rafael C. Jimenez,et al.  myKaryoView: A Light-Weight Client for Visualization of Genomic Data , 2011, PloS one.

[22]  Somesh Jha,et al.  Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing , 2014, USENIX Security Symposium.

[23]  Michael Backes,et al.  Membership Privacy in MicroRNA-based Studies , 2016, CCS.

[24]  Haixu Tang,et al.  Learning your identity and disease from research papers: information leaks in genome wide association study , 2009, CCS.

[25]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[26]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[27]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[28]  Yizhen Wang,et al.  Pufferfish Privacy Mechanisms for Correlated Data , 2016, SIGMOD Conference.

[29]  Stephen E. Fienberg,et al.  Privacy-Preserving Data Sharing for Genome-Wide Association Studies , 2012, J. Priv. Confidentiality.

[30]  Ivan Niven Mathematics of choice or How to Count Without Counting: Partitions of an Integer , 1965 .

[31]  Robert B. Hartlage,et al.  This PDF file includes: Materials and Methods , 2009 .