Addressing Beacon re-identification attacks: quantification and mitigation of privacy risks

Abstract The Global Alliance for Genomics and Health (GA4GH) created the Beacon Project as a means of testing the willingness of data holders to share genetic data in the simplest technical context—a query for the presence of a specified nucleotide at a given position within a chromosome. Each participating site (or “beacon”) is responsible for assuring that genomic data are exposed through the Beacon service only with the permission of the individual to whom the data pertains and in accordance with the GA4GH policy and standards. While recognizing the inference risks associated with large-scale data aggregation, and the fact that some beacons contain sensitive phenotypic associations that increase privacy risk, the GA4GH adjudged the risk of re-identification based on the binary yes/no allele-presence query responses as acceptable. However, recent work demonstrated that, given a beacon with specific characteristics (including relatively small sample size and an adversary who possesses an individual’s whole genome sequence), the individual’s membership in a beacon can be inferred through repeated queries for variants present in the individual’s genome. In this paper, we propose three practical strategies for reducing re-identification risks in beacons. The first two strategies manipulate the beacon such that the presence of rare alleles is obscured; the third strategy budgets the number of accesses per user for each individual genome. Using a beacon containing data from the 1000 Genomes Project, we demonstrate that the proposed strategies can effectively reduce re-identification risk in beacon-like datasets.

[1]  Yaniv Erlich,et al.  Routes for breaching and protecting genetic privacy , 2013, Nature Reviews Genetics.

[2]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[3]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[4]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[5]  Michael I. Jordan,et al.  Genomic privacy and limits of individual detection in a pool , 2009, Nature Genetics.

[6]  Mark Gerstein,et al.  Genomics and Privacy: Implications of the New Reality of Closed Data for the Field , 2011, PLoS Comput. Biol..

[7]  Rachel G Liao,et al.  A federated ecosystem for sharing genomic, clinical data , 2016, Science.

[8]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .

[9]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[10]  C. Bustamante,et al.  Privacy Risks from Genomic Data-Sharing Beacons , 2015, American journal of human genetics.

[11]  Bartha Maria Knoppers,et al.  Framework for responsible sharing of genomic and health-related data , 2014, The HUGO Journal.

[12]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[13]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[14]  Jacob A. Tennessen,et al.  Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes , 2012, Science.

[15]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[16]  Kelly Edwards,et al.  The haystack is made of needles. , 2013, Genetic testing and molecular biomarkers.