Assessing and mitigating privacy risk of sparse, noisy genotypes by local alignment to haplotype databases

Single nucleotide polymorphisms (SNPs) from omics data carry a high risk of reidentification for individuals and their relatives. While the ability of thousands of SNPs (especially rare ones) to identify individuals has been repeatedly demonstrated, the ready availability of small sets of noisy genotypes – such as from environmental DNA samples or functional genomics data – motivated us to quantify their informativeness. Here, we present a computational tool suite, PLIGHT (“Privacy Leakage by Inference across Genotypic HMM Trajectories”), that employs population-genetics-based Hidden Markov Models of recombination and mutation to find piecewise alignment of small, noisy query SNP sets to a reference haplotype database. We explore cases where query individuals are either known to be in a database, or not, and consider a variety of queries, including simulated genotype “mosaics” (composites from 2 source individuals) and genotypes from swabs of coffee cups from a known individual. Using PLIGHT on a database with ~5,000 haplotypes, we find for common, noise-free SNPs that only ten are sufficient to identify individuals, ~20 can identify both components in two-individual simulated mosaics, and 20-30 can identify first-order relatives (parents, children, and siblings). Using noisy coffee-cup-derived SNPs, PLIGHT identifies an individual (within the database) using ~30 SNPs. Moreover, even when the individual is not in the database, local genotype matches allow for some phenotypic information leakage based on coarse-grained GWAS SNP imputation and polygenic risk scores. Overall, PLIGHT maximizes the identifying information content of sparse SNP sets through exact or partial matches to databases. Finally, by quantifying such privacy attacks, PLIGHT helps determine the value of selectively sanitizing released SNPs without explicit assumptions about underlying population membership or allele frequencies. To make this practical, we provide a sanitization tool to remove the most identifying SNPs from a query set.

[1]  M. Gerstein,et al.  Recovering genotypes and phenotypes using allele-specific genes , 2021, Genome biology.

[2]  A. Uitterlinden,et al.  Impact of SNP microarray analysis of compromised DNA on kinship classification success in the context of investigative genetic genealogy , 2021, bioRxiv.

[3]  Prashant S. Emani,et al.  Data Sanitization to Reduce Private Information Leakage from Functional Genomics , 2020, Cell.

[4]  Carmela Troncoso,et al.  GenoShare: Supporting Privacy-Informed Decisions for Sharing Individual-Level Genetic Data , 2020, MIE.

[5]  Mark Gerstein,et al.  Origins and characterization of variants shared between databases of somatic and germline human mutations , 2020, BMC Bioinformatics.

[6]  C. Robino,et al.  Dealing with low amounts of degraded DNA: Evaluation of SNP typing of challenging forensic samples by using massive parallel sequencing , 2019, Forensic Science International: Genetics Supplement Series.

[7]  Michael D. Frachetti,et al.  The formation of human populations in South and Central Asia , 2019, Science.

[8]  M. Gerstein,et al.  Insights into genetics, human biology and disease gleaned from family based genomic studies , 2019, Genetics in Medicine.

[9]  Andreas O. Tillmar,et al.  DNA identification of compromised samples with massive parallel sequencing , 2018, Forensic sciences research.

[10]  Gerton Lunter,et al.  Haplotype matching in large cohorts using the Li and Stephens model , 2018, Bioinform..

[11]  Wan-Ping Lee,et al.  Fast and accurate genomic analyses using genome graphs , 2019, Nature Genetics.

[12]  Paul C. Boutros,et al.  Germline contamination and leakage in whole genome somatic single nucleotide variant detection , 2017, bioRxiv.

[13]  Erman Ayday,et al.  Re-Identification of Individuals in Genomic Data-Sharing Beacons via Allele Inference , 2017, bioRxiv.

[14]  Jordan M. Eizenga,et al.  Genome graphs and the evolution of genome inference , 2017, bioRxiv.

[15]  Xiaoqian Jiang,et al.  Addressing Beacon re-identification attacks: quantification and mitigation of privacy risks , 2017, J. Am. Medical Informatics Assoc..

[16]  Hugh E. Olsen,et al.  The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community , 2016, Genome Biology.

[17]  Shane A. McCarthy,et al.  Reference-based phasing using the Haplotype Reference Consortium panel , 2016, Nature Genetics.

[18]  M. Gerstein,et al.  Quantification of private information leakage from phenotype-genotype data: linking attacks , 2016, Nature Methods.

[19]  C. Bustamante,et al.  Privacy Risks from Genomic Data-Sharing Beacons , 2015, American journal of human genetics.

[20]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[21]  Zhicong Huang,et al.  Quantifying Genomic Privacy via Inference Attack with High-Order SNV Correlations , 2015, 2015 IEEE Security and Privacy Workshops.

[22]  Richard Durbin,et al.  Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT) , 2014, Bioinform..

[23]  Yaniv Erlich,et al.  Routes for breaching and protecting genetic privacy , 2013, Nature Reviews Genetics.

[24]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[25]  N. Cox,et al.  On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. , 2012, American journal of human genetics.

[26]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[27]  J. Marchini,et al.  Genotype imputation for genome-wide association studies , 2010, Nature Reviews Genetics.

[28]  Christopher Meek,et al.  Estimating genome-wide IBD sharing from SNP data via an efficient hidden Markov model of LD with application to gene mapping , 2010, Bioinform..

[29]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[30]  W. G. Hill,et al.  The Limits of Individual Identification from Sample Allele Frequencies: Theory and Statistical Analysis , 2009, PLoS genetics.

[31]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[32]  D. Reich,et al.  Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations , 2009, PLoS genetics.

[33]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[34]  Ion I. Mandoiu,et al.  Imputation-Based Local Ancestry Inference in Admixed Populations , 2009, ISBRA.

[35]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[36]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[37]  Zhen Lin,et al.  Genomic Research and Human Subject Privacy , 2004, Science.

[38]  M. Przeworski Faculty Opinions recommendation of Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003 .

[39]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[40]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.