Analysis of sensitive information leakage in functional genomics signal profiles through genomic deletions

Functional genomics experiments, such as RNA-seq, provide non-individual specific information about gene expression under different conditions such as disease and normal. There is great desire to share these data. However, privacy concerns often preclude sharing of the raw reads. To enable safe sharing, aggregated summaries such as read-depth signal profiles and levels of gene expression are used. Projects such as GTEx and ENCODE share these because they ostensibly do not leak much identifying information. Here, we attempt to quantify the validity of this statement, measuring the leakage of genomic deletions from signal profiles. We present information theoretic measures for the degree to which one can genotype these deletions. We then develop practical genotyping approaches and demonstrate how to use these to identify an individual within a large cohort in the context of linking attacks. Finally, we present an anonymization method removing much of the leakage from signal profiles.Functional genomics data from many studies are widely shared publicly for their value in biomedical and disease research. Here, the authors show sensitive information leakage is possible by analyzing functional genomics signal profiles, and develop an anonymization procedure for privacy protection.

[1]  Jan O Korbel,et al.  Genome assembly and haplotyping with Hi-C , 2013, Nature Biotechnology.

[2]  B. Knoppers,et al.  Are Data Sharing and Privacy Protection Mutually Exclusive? , 2016, Cell.

[3]  Gil McVean,et al.  The 100,000 Genomes Project Protocol , 2017 .

[4]  K. Hao,et al.  Bayesian method to predict individual SNP genotypes from gene expression data , 2012, Nature Genetics.

[5]  R. Redon,et al.  Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes , 2007, Science.

[6]  Mona Singh,et al.  Computational solutions for omics data , 2013, Nature Reviews Genetics.

[7]  M. Gerstein,et al.  Quantification of private information leakage from phenotype-genotype data: linking attacks , 2016, Nature Methods.

[8]  Stephen E. Fienberg,et al.  Privacy Preserving GWAS Data Sharing , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[9]  C. Glass,et al.  Epigenomics: Roadmap for regulation , 2015, Nature.

[10]  Mark Gerstein,et al.  MUSIC: identification of enriched regions in ChIP-Seq experiments using a mappability-corrected multiscale signal processing framework , 2014, Genome Biology.

[11]  Mark Gerstein,et al.  Bioinformatics Applications Note Gene Expression Rseqtools: a Modular Framework to Analyze Rna-seq Data Using Compact, Anonymized Data Summaries , 2022 .

[12]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[13]  A global reference for human genetic variation , 2015, Nature.

[14]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[15]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[16]  Yijun Ruan,et al.  Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array- and sequencing-based technologies. , 2007, Genome research.

[17]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[18]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[19]  Leighton J. Core,et al.  Coordinated Effects of Sequence Variation on DNA Binding, Chromatin Structure, and Transcription , 2013, Science.

[20]  Roland Eils,et al.  Identifying Personal DNA Methylation Profiles by Genotype Inference , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[21]  Vinod Vaikuntanathan,et al.  Computing Blindfolded: New Developments in Fully Homomorphic Encryption , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[22]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[23]  Mark Gerstein,et al.  Issues in the analysis of oligonucleotide tiling microarrays for transcript mapping. , 2005, Trends in genetics : TIG.

[24]  Raymond H. Chan,et al.  Salt-and-pepper noise removal by median-type noise detectors and detail-preserving regularization , 2005, IEEE Transactions on Image Processing.

[25]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[26]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[27]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[28]  Demetrius J Porche,et al.  Precision Medicine Initiative , 2015, American journal of men's health.

[29]  Jonathan K. Pritchard,et al.  Identification of Genetic Variants That Affect Histone Modifications in Human Cells , 2013, Science.

[30]  Dan Xie,et al.  Extensive Variation in Chromatin States Across Humans , 2013, Science.

[31]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[32]  Peter J. Bickel,et al.  Comparative Analysis of the Transcriptome across Distant Species , 2014, Nature.

[33]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[34]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[35]  Neva C. Durand,et al.  A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping , 2014, Cell.

[36]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[37]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[38]  Raymond K. Auerbach,et al.  The real cost of sequencing: higher than you think! , 2011, Genome Biology.

[39]  N. Cox,et al.  On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. , 2012, American journal of human genetics.

[40]  Yann Joly,et al.  Comparative Approaches to Genetic Discrimination: Chasing Shadows? , 2017, Trends in genetics : TIG.

[41]  A. Mortazavi,et al.  Computation for ChIP-seq and RNA-seq studies , 2009, Nature Methods.

[42]  Zhou Wang,et al.  Progressive switching median filter for the removal of impulse noise from highly corrupted images , 1999 .

[43]  Dinah S. Singer,et al.  A U.S. “Cancer Moonshot” to accelerate cancer research , 2016, Science.

[44]  Eric S. Lander,et al.  Hi-C: A Method to Study the Three-dimensional Architecture of Genomes. , 2010, Journal of visualized experiments : JoVE.