Private information leakage from functional genomics data: Quantification with calibration experiments and reduction via data sanitization protocols

The generation of functional genomics datasets is surging, as they provide insight into gene regulation and organismal phenotypes (e.g., genes upregulated in cancer). The intention of functional genomics experiments is not necessarily to study genetic variants, yet they pose privacy concerns due to their use of next-generation sequencing. Moreover, there is a great incentive to share raw reads for better analyses and general research reproducibility. Thus, we need new modes of sharing beyond traditional controlled-access models. Here, we develop a data-sanitization procedure allowing raw functional genomics reads to be shared while minimizing privacy leakage, thus enabling principled privacy-utility trade-offs. It works with traditional Illumina-based assays and newer technologies such as 10x single-cell RNA-sequencing. The procedure depends on quantifying the privacy leakage in reads by statistically linking study participants to known individuals. We carried out these linkages using data from highly accurate reference genomes and more realistic environmental samples.

[1]  Yaniv Erlich,et al.  Identity inference of genomic data using long-range familial searches , 2018, Science.

[2]  Wenhui Wang,et al.  A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis , 2019, GigaScience.

[3]  Mauricio O. Carneiro,et al.  Scaling accurate genetic variant discovery to tens of thousands of samples , 2017, bioRxiv.

[4]  Peter J. Park,et al.  NGSCheckMate: software for validating sample identity in next-generation sequencing studies within and across data types , 2017, Nucleic acids research.

[5]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[6]  Yaniv Erlich,et al.  Routes for breaching and protecting genetic privacy , 2013, Nature Reviews Genetics.

[7]  N. Cox,et al.  On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. , 2012, American journal of human genetics.

[8]  Brian L Browning,et al.  A One-Penny Imputed Genome from Next-Generation Reference Panels. , 2018, American journal of human genetics.

[9]  Stephen E. Fienberg,et al.  Scalable privacy-preserving data sharing methodology for genome-wide association studies , 2014, J. Biomed. Informatics.

[10]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[11]  Maximillian S Westphal,et al.  SMaSH: Sample matching using SNPs in humans , 2019, BMC Genomics.

[12]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[13]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[14]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[15]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[16]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[17]  Raymond K. Auerbach,et al.  The real cost of sequencing: higher than you think! , 2011, Genome Biology.

[18]  Fidel Ramírez,et al.  deepTools2: a next generation web server for deep-sequencing data analysis , 2016, Nucleic Acids Res..

[19]  M. Gerstein,et al.  Quantification of private information leakage from phenotype-genotype data: linking attacks , 2016, Nature Methods.

[20]  Stephen E. Fienberg,et al.  Privacy Preserving GWAS Data Sharing , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[21]  J. Michael Cherry,et al.  ENCODE data at the ENCODE portal , 2015, Nucleic Acids Res..

[22]  Mark Gerstein,et al.  Analysis of sensitive information leakage in functional genomics signal profiles through genomic deletions , 2018, Nature Communications.

[23]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[24]  Peter J. Bickel,et al.  Measuring reproducibility of high-throughput experiments , 2011, 1110.4705.

[25]  E. Schadt The changing privacy landscape in the era of big data , 2012, Molecular systems biology.

[26]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[27]  Tao Huang,et al.  MODMatcher: Multi-Omics Data Matcher for Integrative Genomic Analysis , 2014, PLoS Comput. Biol..

[28]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[29]  K. Hao,et al.  Bayesian method to predict individual SNP genotypes from gene expression data , 2012, Nature Genetics.

[30]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[31]  B. Knoppers,et al.  Are Data Sharing and Privacy Protection Mutually Exclusive? , 2016, Cell.

[32]  Vitaly Shmatikov,et al.  Privacy-preserving data exploration in genome-wide association studies , 2013, KDD.

[33]  Bo Peng,et al.  Large-Scale Privacy-Preserving Mapping of Human Genomic Sequences on Hybrid Clouds , 2012, NDSS.

[34]  L. Sweeney Simple Demographics Often Identify People Uniquely , 2000 .

[35]  Neva C. Durand,et al.  A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping , 2014, Cell.

[36]  David A. Knowles,et al.  Transcriptome Sequencing of a Large Human Family Identifies the Impact of Rare Noncoding Variants , 2014, American journal of human genetics.

[37]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[38]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[39]  Michael D. Edge,et al.  Statistical Detection of Relatives Typed with Disjoint Forensic and Biomedical Loci , 2018, Cell.

[40]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.