Privacy-preserving genomic computation through program specialization

In this paper, we present a new approach to performing important classes of genomic computations (e.g., search for homologous genes) that makes a significant step towards privacy protection in this domain. Our approach leverages a key property of the human genome, namely that the vast majority of it is shared across humans (and hence public), and consequently relatively little of it is sensitive. Based on this observation, we propose a privacy-protection framework that partitions a genomic computation, distributing the part on sensitive data to the data provider and the part on the pubic data to the user of the data. Such a partition is achieved through program specialization that enables a biocomputing program to perform a concrete execution on public data and a symbolic execution on sensitive data. As a result, the program is simplified into an efficient query program that takes only sensitive genetic data as inputs. We prove the effectiveness of our techniques on a set of dynamic programming algorithms common in genomic computing. We develop a program transformation tool that automatically instruments a legacy program for specialization operations. We also demonstrate that our techniques can greatly facilitate secure multi-party computations on large biocomputing problems.

[1]  Bradley Malin,et al.  Protecting DNA Sequence Anonymity with Generalization Lattices , 2004 .

[2]  B A Malin,et al.  Protecting Genomic Sequence Anonymity with Generalization Lattices , 2005, Methods of Information in Medicine.

[3]  Samuel H. Payne,et al.  Discovery and revision of Arabidopsis genes by proteogenomics , 2008, Proceedings of the National Academy of Sciences.

[4]  Ulrik Pagh Schultz,et al.  Towards Automatic Specialization of Java Programs , 1999, ECOOP.

[5]  David Brumley,et al.  Privtrans: Automatically Partitioning Programs for Privilege Separation , 2004, USENIX Security Symposium.

[6]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[7]  Daniel Weise,et al.  Automatic generation of compiled simulations through program specialization , 1991, 28th ACM/IEEE Design Automation Conference.

[8]  Doug Szajda,et al.  Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm , 2006, NDSS.

[9]  Bradley Malin,et al.  Re-identification of Familial Database Records , 2006, AMIA.

[10]  Daniel Weise,et al.  Generating Compiled Simulations Using Partial Evaluation , 1991 .

[11]  Siau-Cheng Khoo,et al.  Compiling inheritance using partial evaluation , 1991 .

[12]  James Newsome,et al.  Dynamic Taint Analysis for Automatic Detection, Analysis, and SignatureGeneration of Exploits on Commodity Software , 2005, NDSS.

[13]  Anand Gupta,et al.  A Unified Audit Expression Model for Auditing SQL Queries , 2008, DBSec.

[14]  Douglas L. T. Rohde,et al.  Modelling the recent common ancestry of all living humans , 2004, Nature.

[15]  Lars Ole Andersen,et al.  Program Analysis and Specialization for the C Programming Language , 2005 .

[16]  J. Davenport Editor , 1960 .

[17]  Silvio Micali,et al.  How to play ANY mental game , 1987, STOC.

[18]  Frances E. Allen,et al.  Control-flow analysis , 2022 .

[19]  Vitaly Shmatikov,et al.  Towards Practical Privacy for Genomic Computation , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[20]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[21]  Greg J. Badros JavaML: a markup language for Java source code , 2000, Comput. Networks.

[22]  Rajeev Motwani,et al.  Towards robustness in query auditing , 2006, VLDB.

[23]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[24]  Gudmundur A. Thorisson,et al.  The International HapMap Project Web site. , 2005, Genome research.

[25]  Josep Domingo-Ferrer,et al.  Inference Control in Statistical Databases, From Theory to Practice , 2002 .

[26]  Robert C. Edgar,et al.  Multiple sequence alignment. , 2006, Current opinion in structural biology.

[27]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[28]  William Pugh,et al.  Partial evaluation of high-level imperative programming languages with applications in hard real-time systems , 1992, POPL '92.

[29]  Peter Sestoft,et al.  An experiment in partial evaluation: the generation of a compiler generator , 1985, SIGP.

[30]  Gad M. Landau,et al.  A sub-quadratic sequence alignment algorithm for unrestricted cost matrices , 2002, SODA '02.

[31]  Dawson R. Engler,et al.  C and tcc: a language and compiler for dynamic code generation , 1999, TOPL.

[32]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[33]  G. Pavesi,et al.  Exalign: a new method for comparative analysis of exon–intron gene structures , 2008, Nucleic acids research.

[34]  Rajeev Motwani,et al.  Auditing SQL Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[35]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[36]  Shay Artzi,et al.  miRNAminer: A tool for homologous microRNA gene search , 2008, BMC Bioinformatics.

[37]  Florian Odronitz,et al.  Scipio: Using protein sequences to determine the precise exon/intron structures of genes and their orthologs in closely related species , 2008, BMC Bioinformatics.

[38]  Nina Mishra,et al.  Simulatable auditing , 2005, PODS.

[39]  D. Nickerson,et al.  Variation is the spice of life , 2001, Nature Genetics.

[40]  Dekel Tsur,et al.  Identification of post-translational modifications by blind search of mass spectra , 2005, Nature Biotechnology.

[41]  Thomas L. Madden,et al.  BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. , 1999, FEMS microbiology letters.

[42]  Peter Sestoft,et al.  Partial evaluation and automatic program generation , 1993, Prentice Hall international series in computer science.

[43]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[44]  L. D. Moura,et al.  The YICES SMT Solver , 2006 .

[45]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[46]  James C. King,et al.  Symbolic execution and program testing , 1976, CACM.

[47]  D. E. Bell,et al.  Secure Computer Systems : Mathematical Foundations , 2022 .

[48]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[49]  Wenliang Du,et al.  Secure and private sequence comparisons , 2003, WPES '03.

[50]  Rovshan G Sadygov,et al.  Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book , 2004, Nature Methods.

[51]  A. Yao,et al.  Fair exchange with a semi-trusted third party (extended abstract) , 1997, CCS '97.

[52]  Andrew C. Myers,et al.  Protecting privacy using the decentralized label model , 2003, Foundations of Intrusion Tolerant Systems, 2003 [Organically Assured and Survivable Information Systems].

[53]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[54]  Dorothy E. Denning,et al.  A lattice model of secure information flow , 1976, CACM.

[55]  Stefan Katzenbeisser,et al.  Privacy-Preserving Matching of DNA Profiles , 2008, IACR Cryptol. ePrint Arch..

[56]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[57]  Jesper Jørgensen,et al.  Generating a compiler for a lazy language by partial evaluation , 1992, POPL '92.

[58]  Siau-Cheng Khoo,et al.  Semantics-Directed Generation of a Prolog Compiler , 1991, Sci. Comput. Program..

[59]  Robert Glück,et al.  Efficient Multi-level Generating Extensions for Program Specialization , 1995, PLILP.

[60]  LiskovBarbara,et al.  Protecting privacy using the decentralized label model , 2000 .

[61]  Andrew C. Myers,et al.  JFlow: practical mostly-static information flow control , 1999, POPL '99.

[62]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[63]  Xin Zheng,et al.  Secure web applications via automatic partitioning , 2007, SOSP.

[64]  Thomas W. Reps,et al.  Program Specialization via Program Slicing , 1996, Dagstuhl Seminar on Partial Evaluation.

[65]  Rajeev Motwani,et al.  Auditing a Batch of SQL Queries , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[66]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[67]  Olivier Danvy,et al.  Tutorial notes on partial evaluation , 1993, POPL '93.

[68]  Richard D. Smith,et al.  Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. , 2007, Genome research.

[69]  Bradley Malin,et al.  Technical Evaluation: An Evaluation of the Current State of Genomic Data Privacy Protection Technology and a Roadmap for the Future , 2004, J. Am. Medical Informatics Assoc..