Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons

Sharing genome data in a privacy-preserving way stands as a major bottleneck in front of the scientific progress promised by the big data era in genomics. A community-driven protocol named genomic data-sharing beacon protocol has been widely adopted for sharing genomic data. The system aims to provide a secure, easy to implement, and standardized interface for data sharing by only allowing yes/no queries on the presence of specific alleles in the dataset. However, beacon protocol was recently shown to be vulnerable against membership inference attacks. In this paper, we show that privacy threats against genomic data sharing beacons are not limited to membership inference. We identify and analyze a novel vulnerability of genomic data-sharing beacons: genome reconstruction. We show that it is possible to successfully reconstruct a substantial part of the genome of a victim when the attacker knows the victim has been added to the beacon in a recent update. We also show that even if multiple individuals are added to the beacon during the same update, it is possible to identify the victim's genome with high confidence using traits that are easily accessible by the attacker (e.g., eye and hair color). Moreover, we show how the reconstructed genome using a beacon that is not associated with a sensitive phenotype can be used for membership inference attacks to beacons with sensitive phenotypes (i.e., HIV+). The outcome of this work will guide beacon operators on when and how to update the content of the beacon. Thus, this work will be an important attempt at helping beacon operators and participants make informed decisions.

[1]  Chunlei Liu,et al.  ClinVar: improving access to variant interpretations and supporting evidence , 2017, Nucleic Acids Res..

[2]  Claude Bouchard,et al.  A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance , 2012, Nature Genetics.

[3]  J. Gitschier,et al.  Inferential genotyping of Y chromosomes in Latter-Day Saints founders and comparison to Utah samples in the HapMap project. , 2009, American journal of human genetics.

[4]  Bradley Malin,et al.  How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems , 2004, J. Biomed. Informatics.

[5]  Erika Check Hayden,et al.  Privacy protections: The genome hacker , 2013, Nature.

[6]  Haixu Tang,et al.  Learning your identity and disease from research papers: information leaks in genome wide association study , 2009, CCS.

[7]  Carl A. Gunter,et al.  Privacy in the Genomic Era , 2014, ACM Comput. Surv..

[8]  Manfred Kayser,et al.  IrisPlex: a sensitive DNA tool for accurate prediction of blue and brown eye colour in the absence of ancestry information. , 2011, Forensic science international. Genetics.

[9]  Yaniv Erlich,et al.  Routes for breaching and protecting genetic privacy , 2013 .

[10]  Oznur Tastan,et al.  A utility maximizing and privacy preserving approach for protecting kinship in genomic databases , 2018, Bioinform..

[11]  Jared C. Roach,et al.  Kaviar: an accessible system for testing SNV novelty , 2011, Bioinform..

[12]  W. G. Hill,et al.  The Limits of Individual Identification from Sample Allele Frequencies: Theory and Statistical Analysis , 2009, PLoS genetics.

[13]  Xiaoqian Jiang,et al.  Addressing Beacon re-identification attacks: quantification and mitigation of privacy risks , 2017, J. Am. Medical Informatics Assoc..

[14]  Zhen Lin,et al.  Genomic Research and Human Subject Privacy , 2004, Science.

[15]  Huan Wang,et al.  Predicting Human Age with Bloodstains by sjTREC Quantification , 2012, PloS one.

[16]  Vitaly Shmatikov,et al.  Towards Practical Privacy for Genomic Computation , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[17]  Stefan Katzenbeisser,et al.  Privacy preserving error resilient dna searching through oblivious automata , 2007, CCS '07.

[18]  Emiliano De Cristofaro,et al.  The Chills and Thrills of Whole Genome Sequencing , 2013, Computer.

[19]  Marleen de Bruijne,et al.  A Genome-Wide Association Study Identifies Five Loci Influencing Facial Morphology in Europeans , 2012, PLoS genetics.

[20]  Iman Deznabi,et al.  An Inference Attack on Genomic Data Using Kinship, Complex Correlations, and Phenotype Information , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[22]  D. Clayton On inferring presence of an individual in a mixture: a Bayesian approach , 2010, Biostatistics.

[23]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[24]  Md Momin Al Aziz,et al.  Aftermath of bustamante attack on genomic beacon service , 2017, BMC Medical Genomics.

[25]  Stephen E. Fienberg,et al.  Scalable privacy-preserving data sharing methodology for genome-wide association studies , 2014, J. Biomed. Informatics.

[26]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[27]  Zhicong Huang,et al.  Differential Privacy with Bounded Priors: Reconciling Utility and Privacy in Genome-Wide Association Studies , 2015, CCS.

[28]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[29]  Heidi Ledford,et al.  AstraZeneca launches project to sequence 2 million genomes , 2016, Nature.

[30]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[31]  M. Kayser,et al.  Estimating human age from T-cell DNA rearrangements , 2010, Current Biology.

[32]  Michael I. Jordan,et al.  Genomic privacy and limits of individual detection in a pool , 2009, Nature Genetics.

[33]  C. Bustamante,et al.  Privacy Risks from Genomic Data-Sharing Beacons , 2015, American journal of human genetics.

[34]  Latanya Sweeney,et al.  Identifying Participants in the Personal Genome Project by Name , 2013, ArXiv.

[35]  M. Schatz Biological data sciences in genome research , 2015, Genome research.

[36]  Carl A. Gunter,et al.  Controlled Functional Encryption , 2014, CCS.

[37]  Emiliano De Cristofaro,et al.  Countering GATTACA: efficient and secure testing of fully-sequenced human genomes , 2011, CCS '11.

[38]  C. V. D. Malsburg,et al.  Frank Rosenblatt: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms , 1986 .

[39]  Paul Suetens,et al.  Modeling 3D Facial Shape from DNA , 2014, PLoS genetics.

[40]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[41]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[42]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[43]  P. Bayer,et al.  openSNP–A Crowdsourced Web Resource for Personal Genomics , 2014, PloS one.

[44]  Eun Yong Kang,et al.  Identification of individuals by trait prediction using whole-genome sequencing data , 2017, Proceedings of the National Academy of Sciences.

[45]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[46]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[47]  S. Chanock,et al.  A new statistic and its power to infer membership and phenotype in a genome-wide association study using genotype frequencies , 2009, Nature Genetics.

[48]  Ayellet V. Segrè,et al.  Hundreds of variants clustered in genomic loci and biological pathways affect human height , 2010, Nature.

[49]  Jean-Pierre Hubaux,et al.  Protecting and evaluating genomic privacy in medical tests and personalized medicine , 2013, WPES.

[50]  Vitaly Shmatikov,et al.  Privacy-preserving data exploration in genome-wide association studies , 2013, KDD.

[51]  Stephen E. Fienberg,et al.  Privacy Preserving GWAS Data Sharing , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[52]  Emiliano De Cristofaro,et al.  Secure genomic testing with size- and position-hiding private substring matching , 2013, WPES.

[53]  Zhicong Huang,et al.  Quantifying Genomic Privacy via Inference Attack with High-Order SNV Correlations , 2015, 2015 IEEE Security and Privacy Workshops.

[54]  Mikhail J. Atallah,et al.  Secure and Efficient Outsourcing of Sequence Comparisons , 2012, ESORICS.

[55]  Bo Peng,et al.  To Release or Not to Release: Evaluating Information Leaks in Aggregate Human-Genome Data , 2011, ESORICS.

[56]  Jean-Pierre Hubaux,et al.  De-anonymizing Genomic Databases Using Phenotypic Traits , 2015, Proc. Priv. Enhancing Technol..

[57]  Erman Ayday,et al.  Re-Identification of Individuals in Genomic Data-Sharing Beacons via Allele Inference , 2017, bioRxiv.

[58]  Cesar H. Comin,et al.  Clustering algorithms: A comparative approach , 2016, PloS one.

[59]  Manfred Kayser,et al.  Improving human forensics through advances in genetics, genomics and molecular biology , 2011, Nature Reviews Genetics.

[60]  Jean-Pierre Hubaux,et al.  Addressing the concerns of the lacks family: quantification of kin genomic privacy , 2013, CCS.

[61]  N. Cox,et al.  On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. , 2012, American journal of human genetics.

[62]  †The International HapMap Consortium The International HapMap Project , 2003, Nature.