Detecting the Presence of an Individual in Phenotypic Summary Data

As the quantity and detail of association studies between clinical phenotypes and genotypes grows, there is a push to make summary statistics widely available. Genome wide summary statistics have been shown to be vulnerable to the inference of a targeted individual's presence. In this paper, we show that presence attacks are feasible with phenome wide summary statistics as well. We use data from three healthcare organizations and an online resource that publishes summary statistics. We introduce a novel attack that achieves over 80% recall and precision within a population of 16,346, where 8,173 individuals are targets. However, the feasibility of the attack is dependent on the attacker's knowledge about 1) the targeted individual and 2) the reference dataset. Within a population of over 2 million, where 8,173 individuals are targets, our attack achieves 31% recall and 17% precision. As a result, it is plausible that sharing of phenomic summary statistics may be accomplished with an acceptable level of privacy risk.

[1]  Joshua C. Denny,et al.  The disclosure of diagnosis codes can breach research participants' privacy , 2010, J. Am. Medical Informatics Assoc..

[2]  C. Bustamante,et al.  Privacy Risks from Genomic Data-Sharing Beacons , 2015, American journal of human genetics.

[3]  Marylyn D. Ritchie,et al.  PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations , 2010, Bioinform..

[4]  Murat Kantarcioglu,et al.  Controlling the signal: Practical privacy protection of genomic data sharing through Beacon services , 2017, BMC Medical Genomics.

[5]  Joshua C Denny,et al.  MR-PheWAS: exploring the causal effect of SUA level on multiple disease outcomes by using genetic instruments in UK Biobank , 2018, Annals of the rheumatic diseases.

[6]  Murat Kantarcioglu,et al.  Expanding Access to Large-Scale Genomic Data While Promoting Privacy: A Game Theoretic Approach. , 2017, American journal of human genetics.

[7]  Raymond Heatherly,et al.  A Game Theoretic Framework for Analyzing Re-Identification Risk , 2015, PloS one.

[8]  Monte Westerfield,et al.  Bedside Back to Bench: Building Bridges between Basic and Clinical Genomic Research , 2017, Cell.

[9]  Daniel G. MacArthur,et al.  The ExAC browser: displaying reference data information from over 60 000 exomes , 2016, bioRxiv.

[10]  Robert M. Goor,et al.  Assessing and managing risk when sharing aggregate genetic variant data , 2011, Nature Reviews Genetics.

[11]  Bradley Malin,et al.  Determining the identifiability of DNA database entries , 2000, AMIA.

[12]  David L. Buckeridge,et al.  The re-identification risk of Canadians from longitudinal demographics , 2011, BMC Medical Informatics Decis. Mak..

[13]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[14]  Melissa A. Basford,et al.  The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future , 2013, Genetics in Medicine.

[15]  Zhen Lin,et al.  Genomic Research and Human Subject Privacy , 2004, Science.

[16]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[17]  Suzette J. Bielinski,et al.  Design and Anticipated Outcomes of the eMERGE-PGx Project: A Multi-Center Pilot for Pre-Emptive Pharmacogenomics in Electronic Health Record Systems , 2014, Clinical pharmacology and therapeutics.

[18]  E. Zerhouni,et al.  Protecting Aggregate Genomic Data , 2008, Science.

[19]  Xiaoqian Jiang,et al.  Addressing Beacon re-identification attacks: quantification and mitigation of privacy risks , 2017, J. Am. Medical Informatics Assoc..

[20]  D. Roden,et al.  Biobanks and Electronic Medical Records: Enabling Cost-Effective Research , 2014, Science Translational Medicine.

[21]  Randolph A. Miller,et al.  Reducing patient re-identification risk for laboratory results within research datasets , 2013, J. Am. Medical Informatics Assoc..

[22]  Carl A. Gunter,et al.  Privacy in the Genomic Era , 2014, ACM Comput. Surv..

[23]  Sara Chandros Hull,et al.  Patients' Views on Identifiability of Samples and Informed Consent for Genetic Research , 2008, The American journal of bioethics : AJOB.

[24]  Bradley Malin,et al.  How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems , 2004, J. Biomed. Informatics.

[25]  Hilary S. Leeds,et al.  Data use under the NIH GWAS Data Sharing Policy and future directions , 2014, Nature Genetics.

[26]  Madeleine P. Ball,et al.  Harvard Personal Genome Project: lessons from participatory public research , 2014, Genome Medicine.

[27]  Michael I. Jordan,et al.  Genomic privacy and limits of individual detection in a pool , 2009, Nature Genetics.

[28]  Bradley Malin,et al.  Anonymising and sharing individual patient data , 2015, BMJ : British Medical Journal.

[29]  Raymond Heatherly,et al.  Size matters: How population size influences genotype-phenotype association studies in anonymized data , 2014, J. Biomed. Informatics.

[30]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.