A High-Throughput Method to Detect Privacy-Sensitive Human Genomic Data

Finding the balance between privacy protection and data sharing is one of the main challenges in managing human genomic data nowadays. Novel privacy-enhancing technologies are required to address the known disclosure threats to personal sensitive genomic data without precluding data sharing. In this paper, we propose a method that systematically detects privacy-sensitive DNA segments coming directly from an input stream, using as reference a knowledge database of known privacy-sensitive nucleic and amino acid sequences. We show that adding our detection method to standard security techniques provides a robust, efficient privacy-preserving solution that neutralizes threats related to recently published attacks on genome privacy based on short tandem repeats, disease-related genes, and genomic variations. Current global knowledge on human genomes demonstrates the feasibility of our approach to obtain a comprehensive database immediately, which can also evolve automatically to address future attacks as new privacy-sensitive sequences are identified. Additionally, we validate that the detection method can be fitted inline with the NGS---Next Generation Sequencing---production cycle by using Bloom filters and scaling out to faster sequencing machines.

[1]  James A. Cuff,et al.  Distinguishing protein-coding and noncoding genes in the human genome , 2007, Proceedings of the National Academy of Sciences.

[2]  Haixu Tang,et al.  Learning your identity and disease from research papers: information leaks in genome wide association study , 2009, CCS.

[3]  M. Jobling,et al.  What's in a name? Y chromosomes, surnames and the genetic genealogy revolution. , 2009, Trends in genetics : TIG.

[4]  Steven E. Brenner Be prepared for the big genome leak , 2013, Nature.

[5]  P. Visscher,et al.  On Jim Watson's APOE status: genetic information is hard to hide , 2009, European Journal of Human Genetics.

[6]  Miguel Correia,et al.  SCFS: A Shared Cloud-backed File System , 2014, USENIX Annual Technical Conference.

[7]  Marc Dacier,et al.  Towards a taxonomy of intrusion-detection systems , 1999, Comput. Networks.

[8]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[9]  Marc Via i García An integrated map of genetic variation from 1,092 human genomes , 2012 .

[10]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[11]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[12]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[13]  Hao Fan,et al.  A Brief Review of Short Tandem Repeat Mutation , 2007, Genom. Proteom. Bioinform..

[14]  Vivien Marx Genomics in the clouds , 2013, Nature Methods.

[15]  Lin Liu,et al.  Comparison of Next-Generation Sequencing Systems , 2012, Journal of biomedicine & biotechnology.

[16]  XiaoFeng Wang,et al.  Sedic: privacy-aware data intensive computing on hybrid clouds , 2011, CCS '11.

[17]  Mark Gerstein,et al.  Genomics and Privacy: Implications of the New Reality of Closed Data for the Field , 2011, PLoS Comput. Biol..

[18]  Carl A. Gunter,et al.  Privacy in the Genomic Era , 2014, ACM Comput. Surv..

[19]  Yaniv Erlich,et al.  Routes for breaching and protecting genetic privacy , 2013, Nature Reviews Genetics.

[20]  Gary Benson,et al.  TRDB—The Tandem Repeats Database , 2006, Nucleic Acids Res..

[21]  Adam Molyneaux,et al.  Privacy-Preserving Processing of Raw Genomic Data , 2013, DPM/SETOP.

[22]  João D. Ferreira,et al.  Identifying interactions between chemical entities in biomedical text , 2014, J. Integr. Bioinform..

[23]  J. Butler,et al.  Genetics and Genomics of Core Short Tandem Repeat Loci Used in Human Identity Testing , 2006, Journal of forensic sciences.

[24]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[25]  Zhen Lin,et al.  Genomic Research and Human Subject Privacy , 2004, Science.

[26]  Emiliano De Cristofaro,et al.  Whole Genome Sequencing: Revolutionary Medicine or Privacy Nightmare? , 2015, Computer.

[27]  John M. Butler,et al.  STRBase: a short tandem repeat DNA database for the human identity testing community , 2001, Nucleic Acids Res..

[28]  Alysson Neves Bessani,et al.  E-biobanking: What Have You Done to My Cell Samples? , 2013, IEEE Security & Privacy.

[29]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[30]  Heather Skirton,et al.  Direct-to-consumer genomic testing: systematic review of the literature on user perspectives , 2012, European Journal of Human Genetics.

[31]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[32]  Tsviya Olender,et al.  GeneCards Version 3: the human gene integrator , 2010, Database J. Biol. Databases Curation.