Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons

Abstract Sharing genome data in a privacy-preserving way stands as a major bottleneck in front of the scientific progress promised by the big data era in genomics. A community-driven protocol named genomic data-sharing beacon protocol has been widely adopted for sharing genomic data. The system aims to provide a secure, easy to implement, and standardized interface for data sharing by only allowing yes/no queries on the presence of specific alleles in the dataset. However, beacon protocol was recently shown to be vulnerable against membership inference attacks. In this paper, we show that privacy threats against genomic data sharing beacons are not limited to membership inference. We identify and analyze a novel vulnerability of genomic data-sharing beacons: genome reconstruction. We show that it is possible to successfully reconstruct a substantial part of the genome of a victim when the attacker knows the victim has been added to the beacon in a recent update. In particular, we show how an attacker can use the inherent correlations in the genome and clustering techniques to run such an attack in an efficient and accurate way. We also show that even if multiple individuals are added to the beacon during the same update, it is possible to identify the victim’s genome with high confidence using traits that are easily accessible by the attacker (e.g., eye color or hair type). Moreover, we show how a reconstructed genome using a beacon that is not associated with a sensitive phenotype can be used for membership inference attacks to beacons with sensitive phenotypes (e.g., HIV+). The outcome of this work will guide beacon operators on when and how to update the content of the beacon and help them (along with the beacon participants) make informed decisions.

[1]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[2]  Haixu Tang,et al.  Learning your identity and disease from research papers: information leaks in genome wide association study , 2009, CCS.

[3]  P. Bayer,et al.  openSNP–A Crowdsourced Web Resource for Personal Genomics , 2014, PloS one.

[4]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[5]  Emiliano De Cristofaro,et al.  The Chills and Thrills of Whole Genome Sequencing , 2013, Computer.

[6]  S. Chanock,et al.  A new statistic and its power to infer membership and phenotype in a genome-wide association study using genotype frequencies , 2009, Nature Genetics.

[7]  J. Gitschier,et al.  Inferential genotyping of Y chromosomes in Latter-Day Saints founders and comparison to Utah samples in the HapMap project. , 2009, American journal of human genetics.

[8]  Bradley Malin,et al.  How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems , 2004, J. Biomed. Informatics.

[9]  Zhen Lin,et al.  Genomic Research and Human Subject Privacy , 2004, Science.

[10]  Cesar H. Comin,et al.  Clustering algorithms: A comparative approach , 2016, PloS one.

[11]  Emiliano De Cristofaro,et al.  Countering GATTACA: efficient and secure testing of fully-sequenced human genomes , 2011, CCS '11.

[12]  Jean-Pierre Hubaux,et al.  Addressing the concerns of the lacks family: quantification of kin genomic privacy , 2013, CCS.

[13]  Latanya Sweeney,et al.  Identifying Participants in the Personal Genome Project by Name , 2013, ArXiv.

[14]  Zhicong Huang,et al.  Quantifying Genomic Privacy via Inference Attack with High-Order SNV Correlations , 2015, 2015 IEEE Security and Privacy Workshops.

[15]  Carl A. Gunter,et al.  Privacy in the Genomic Era , 2014, ACM Comput. Surv..

[16]  Erman Ayday,et al.  Re-Identification of Individuals in Genomic Data-Sharing Beacons via Allele Inference , 2017, bioRxiv.

[17]  Xiaoqian Jiang,et al.  Addressing Beacon re-identification attacks: quantification and mitigation of privacy risks , 2017, J. Am. Medical Informatics Assoc..

[18]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[19]  Yang Zhang,et al.  Updates-Leak: Data Set Inference and Reconstruction Attacks in Online Learning , 2019, USENIX Security Symposium.

[20]  Vitaly Shmatikov,et al.  Privacy-preserving data exploration in genome-wide association studies , 2013, KDD.

[21]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[22]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[23]  Manfred Kayser,et al.  Improving human forensics through advances in genetics, genomics and molecular biology , 2011, Nature Reviews Genetics.

[24]  Ayellet V. Segrè,et al.  Hundreds of variants clustered in genomic loci and biological pathways affect human height , 2010, Nature.

[25]  Paul Suetens,et al.  Modeling 3D Facial Shape from DNA , 2014, PLoS genetics.

[26]  Marleen de Bruijne,et al.  A Genome-Wide Association Study Identifies Five Loci Influencing Facial Morphology in Europeans , 2012, PLoS genetics.

[27]  M. Schatz Biological data sciences in genome research , 2015, Genome research.

[28]  Stephen E. Fienberg,et al.  Scalable privacy-preserving data sharing methodology for genome-wide association studies , 2014, J. Biomed. Informatics.

[29]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[30]  Huan Wang,et al.  Predicting Human Age with Bloodstains by sjTREC Quantification , 2012, PloS one.

[31]  Stephen E. Fienberg,et al.  Privacy Preserving GWAS Data Sharing , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[32]  Erika Check Hayden,et al.  Privacy protections: The genome hacker , 2013, Nature.

[33]  N. Cox,et al.  On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. , 2012, American journal of human genetics.

[34]  Manfred Kayser,et al.  IrisPlex: a sensitive DNA tool for accurate prediction of blue and brown eye colour in the absence of ancestry information. , 2011, Forensic science international. Genetics.

[35]  Carl A. Gunter,et al.  Controlled Functional Encryption , 2014, CCS.

[36]  윤재량 2004 , 2019, The Winning Cars of the Indianapolis 500.

[37]  Z. Imran Disease risk , 2011, BDJ.

[38]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[39]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[40]  Michael I. Jordan,et al.  Genomic privacy and limits of individual detection in a pool , 2009, Nature Genetics.

[41]  Iman Deznabi,et al.  An Inference Attack on Genomic Data Using Kinship, Complex Correlations, and Phenotype Information , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[42]  C. V. D. Malsburg,et al.  Frank Rosenblatt: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms , 1986 .

[43]  Jean-Pierre Hubaux,et al.  Protecting and evaluating genomic privacy in medical tests and personalized medicine , 2013, WPES.

[44]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[45]  Emiliano De Cristofaro,et al.  Secure genomic testing with size- and position-hiding private substring matching , 2013, WPES.

[46]  Mikhail J. Atallah,et al.  Secure and Efficient Outsourcing of Sequence Comparisons , 2012, ESORICS.

[47]  Stefan Katzenbeisser,et al.  Privacy preserving error resilient dna searching through oblivious automata , 2007, CCS '07.

[48]  Vitaly Shmatikov,et al.  Towards Practical Privacy for Genomic Computation , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[49]  Chunlei Liu,et al.  ClinVar: improving access to variant interpretations and supporting evidence , 2017, Nucleic Acids Res..

[50]  M. Kayser,et al.  Estimating human age from T-cell DNA rearrangements , 2010, Current Biology.

[51]  Oznur Tastan,et al.  A utility maximizing and privacy preserving approach for protecting kinship in genomic databases , 2018, Bioinform..

[52]  D. Clayton On inferring presence of an individual in a mixture: a Bayesian approach , 2010, Biostatistics.

[53]  Jean-Pierre Hubaux,et al.  De-anonymizing Genomic Databases Using Phenotypic Traits , 2015, Proc. Priv. Enhancing Technol..

[54]  Yaniv Erlich,et al.  Routes for breaching and protecting genetic privacy , 2013, Nature Reviews Genetics.

[55]  W. G. Hill,et al.  The Limits of Individual Identification from Sample Allele Frequencies: Theory and Statistical Analysis , 2009, PLoS genetics.

[56]  Jared C. Roach,et al.  Kaviar: an accessible system for testing SNV novelty , 2011, Bioinform..

[57]  Zhicong Huang,et al.  Differential Privacy with Bounded Priors: Reconciling Utility and Privacy in Genome-Wide Association Studies , 2015, CCS.

[58]  Md Momin Al Aziz,et al.  Aftermath of bustamante attack on genomic beacon service , 2017, BMC Medical Genomics.

[59]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[60]  Heidi Ledford,et al.  AstraZeneca launches project to sequence 2 million genomes , 2016, Nature.

[61]  Yang Zhang,et al.  MBeacon: Privacy-Preserving Beacons for DNA Methylation Data , 2019, NDSS.

[62]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[63]  M. Gribaudo,et al.  2002 , 2001, Cell and Tissue Research.

[64]  Bo Peng,et al.  To Release or Not to Release: Evaluating Information Leaks in Aggregate Human-Genome Data , 2011, ESORICS.

[65]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[66]  C. Bustamante,et al.  Privacy Risks from Genomic Data-Sharing Beacons , 2015, American journal of human genetics.

[67]  Claude Bouchard,et al.  A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance , 2012, Nature Genetics.

[68]  Eun Yong Kang,et al.  Identification of individuals by trait prediction using whole-genome sequencing data , 2017, Proceedings of the National Academy of Sciences.