Accurate filtering of privacy-sensitive information in raw genomic data.

Sequencing thousands of human genomes has enabled breakthroughs in many areas, among them precision medicine, the study of rare diseases, and forensics. However, mass collection of such sensitive data entails enormous risks if not protected to the highest standards. In this article, we follow the position and argue that post-alignment privacy is not enough and that data should be automatically protected as early as possible in the genomics workflow, ideally immediately after the data is produced. We show that a previous approach for filtering short reads cannot extend to long reads and present a novel filtering approach that classifies raw genomic data (i.e., whose location and content is not yet determined) into privacy-sensitive (i.e., more affected by a successful privacy attack) and non-privacy-sensitive information. Such a classification allows the fine-grained and automated adjustment of protective measures to mitigate the possible consequences of exposure, in particular when relying on public clouds. We present the first filter that can be indistinctly applied to reads of any length, i.e., making it usable with any recent or future sequencing technologies. The filter is accurate, in the sense that it detects all known sensitive nucleotides except those located in highly variable regions (less than 10 nucleotides remain undetected per genome instead of 100,000 in previous works). It has far less false positives than previously known methods (10% instead of 60%) and can detect sensitive nucleotides despite sequencing errors (86% detected instead of 56% with 2% of mutations). Finally, practical experiments demonstrate high performance, both in terms of throughput and memory consumption.

[1]  Carl A. Gunter,et al.  Privacy in the Genomic Era , 2014, ACM Comput. Surv..

[2]  Simona Soverini,et al.  Comparison of Next-Generation Sequencing Systems , 2013 .

[3]  Mete Akgün,et al.  Privacy preserving processing of genomic data: A survey , 2015, J. Biomed. Informatics.

[4]  R. Myers,et al.  Advancements in Next-Generation Sequencing. , 2016, Annual review of genomics and human genetics.

[5]  P. Visscher,et al.  On Jim Watson's APOE status: genetic information is hard to hide , 2009, European Journal of Human Genetics.

[6]  K. Kidd,et al.  Developing a SNP panel for forensic identification of individuals. , 2006, Forensic science international.

[7]  Christian Borgs,et al.  Secure Privacy Preserving Record Linkage of Large Databases by Modified Bloom Filter Encodings , 2017 .

[8]  Zhen Lin,et al.  Using binning to maintain confidentiality of medical data , 2002, AMIA.

[9]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[10]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[11]  Yaniv Erlich,et al.  Routes for breaching and protecting genetic privacy , 2013 .

[12]  Rafail Ostrovsky,et al.  5PM: Secure Pattern Matching , 2012, SCN.

[13]  Michael T. Goodrich,et al.  The Mastermind Attack on Genomic Data , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[14]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[15]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[16]  Russ B Altman,et al.  Challenges for biomedical informatics and pharmacogenomics. , 2002, Annual review of pharmacology and toxicology.

[17]  Jean-Pierre Hubaux,et al.  Addressing the concerns of the lacks family: quantification of kin genomic privacy , 2013, CCS.

[18]  XiaoFeng Wang,et al.  Sedic: privacy-aware data intensive computing on hybrid clouds , 2011, CCS '11.

[19]  B. Malin,et al.  Anonymization of electronic medical records for validating genome-wide association studies , 2010, Proceedings of the National Academy of Sciences.

[20]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[21]  Bradley Malin,et al.  Technical Evaluation: An Evaluation of the Current State of Genomic Data Privacy Protection Technology and a Roadmap for the Future , 2004, J. Am. Medical Informatics Assoc..

[22]  Murat Kantarcioglu,et al.  A Cryptographic Approach to Securely Share and Query Genomic Sequences , 2008, IEEE Transactions on Information Technology in Biomedicine.

[23]  Pan Li,et al.  Cloud-Assisted Mobile-Access of Health Data With Privacy and Auditability , 2014, IEEE Journal of Biomedical and Health Informatics.

[24]  Benjamin Fabian,et al.  Collaborative and secure sharing of healthcare data in multi-clouds , 2015, Inf. Syst..

[25]  Paulo Esteves-Verissimo,et al.  Accurate Filtering of Privacy-Sensitive Information in Raw Genomic Data , 2018, bioRxiv.

[26]  Alysson Neves Bessani,et al.  A High-Throughput Method to Detect Privacy-Sensitive Human Genomic Data , 2015, WPES@CCS.

[27]  Murat Kantarcioglu,et al.  Composite Bloom Filters for Secure Record Linkage , 2014, IEEE Transactions on Knowledge and Data Engineering.

[28]  Jean-Pierre Hubaux,et al.  De-anonymizing Genomic Databases Using Phenotypic Traits , 2015, Proc. Priv. Enhancing Technol..

[29]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[30]  Bradley Malin,et al.  Determining the identifiability of DNA database entries , 2000, AMIA.

[31]  Rainer Schnell,et al.  Bmc Medical Informatics and Decision Making Privacy-preserving Record Linkage Using Bloom Filters , 2022 .

[32]  Kenneth K. Kidd,et al.  SNPs for a universal individual identification panel , 2010, Human Genetics.

[33]  Wenliang Du,et al.  Secure and private sequence comparisons , 2003, WPES '03.

[34]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[35]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[36]  Xiaoqian Jiang,et al.  A community assessment of privacy preserving techniques for human genomes , 2014, BMC Medical Informatics and Decision Making.

[37]  Bradley Malin,et al.  How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems , 2004, J. Biomed. Informatics.

[38]  Murat Kantarcioglu,et al.  A Constraint Satisfaction Cryptanalysis of Bloom Filters in Private Record Linkage , 2011, PETS.

[39]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[40]  Vivek Chaudhari Privacy issues in personalized medicine research: Proposing a novel framework for policy , 2015 .

[41]  Alptekin Küpçü,et al.  Research issues for privacy and security of electronic health services , 2017, Future Gener. Comput. Syst..

[42]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[43]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[44]  Xiaohong Su,et al.  Improvements on a privacy-protection algorithm for DNA sequences with generalization lattices , 2012, Comput. Methods Programs Biomed..

[45]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[46]  C. Bustamante,et al.  Privacy Risks from Genomic Data-Sharing Beacons , 2015, American journal of human genetics.

[47]  Mark Gerstein,et al.  Genomics and Privacy: Implications of the New Reality of Closed Data for the Field , 2011, PLoS Comput. Biol..