Securing Aggregate Queries for DNA Databases

This paper addresses the problem of sharing person-specific genomic sequences without violating the privacy of their data subjects to support large-scale biomedical research projects. The proposed method builds on the framework proposed by Kantarcioglu et al. [1] but extends the results in a number of ways. One improvement is that our scheme is deterministic, with zero probability of a wrong answer (as opposed to a low probability). We also provide a new operating point in the space-time tradeoff, by offering a scheme that is twice as fast as theirs but uses twice the storage space. This point is motivated by the fact that storage is cheaper than computation in current cloud computing pricing plans. Moreover, our encoding of the data makes it possible for us to handle a richer set of queries than exact matching between the query and each sequence of the database, including: (i) counting the number of matches between the query symbols and a sequence; (ii) logical OR matches where a query symbol is allowed to match a subset of the alphabet thereby making it possible to handle (as a special case) a “not equal to” requirement for a query symbol (e.g., “not a G”); (iii) support for the extended alphabet of nucleotide base codes that encompasses ambiguities in DNA sequences (this happens on the DNA sequence side instead of the query side); (iv) queries that specify the number of occurrences of each kind of symbol in the specified sequence positions (e.g., two ‘A’ and four ‘C’ and one ‘G’ and three ‘T’, occurring in any order in the query-specified sequence positions); (v) a threshold query whose answer is ‘yes’ if the number of matches exceeds a query-specified threshold (e.g., “7 or more matches out of the 15 query-specified positions”). (vi) For all query types, we can hide the answers from the decrypting server, so that only the client learns the answer. (vii) In all cases, the client deterministically learns only the query's answer, except for query type (v) where we quantify the (very small) statistical leakage to the client of the actual count.

[1]  Yihua Zhang,et al.  An Overview of Issues and Recent Developments in Cloud Computing and Storage Security , 2014 .

[2]  Pascal Paillier,et al.  Public-Key Cryptosystems Based on Composite Degree Residuosity Classes , 1999, EUROCRYPT.

[3]  David Chaum,et al.  Blind Signatures for Untraceable Payments , 1982, CRYPTO.

[4]  Claude Liébecq Biochemical nomenclature and related documents : a compendium , 1992 .

[5]  Ernest F. Brickell,et al.  Fast Exponentiation with Precomputation (Extended Abstract) , 1992, EUROCRYPT.

[6]  Taher ElGamal,et al.  A public key cyryptosystem and signature scheme based on discrete logarithms , 1985 .

[7]  Paul Helman,et al.  Protecting data privacy through hard-to-reverse negative databases , 2007, International Journal of Information Security.

[8]  Marina Blanton,et al.  Secure Outsourcing of DNA Searching via Finite Automata , 2010, DBSec.

[9]  Josh Benaloh,et al.  Dense Probabilistic Encryption , 1999 .

[10]  Matthew K. Franklin,et al.  Communication-Efficient Private Protocols for Longest Common Subsequence , 2009, CT-RSA.

[11]  Bradley Malin,et al.  How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems , 2004, J. Biomed. Informatics.

[12]  Adi Shamir,et al.  A method for obtaining digital signatures and public-key cryptosystems , 1978, CACM.

[13]  Markus Jakobsson,et al.  Cryptographic Approaches to Provacy in Forensic DNA Databases , 2000, Public Key Cryptography.

[14]  Doug Szajda,et al.  Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm , 2006, NDSS.

[15]  Keith B. Frikken Practical Private DNA String Searching and Matching through Efficient Oblivious Automata Evaluation , 2009, DBSec.

[16]  Mikhail J. Atallah,et al.  Secure outsourcing of sequence comparisons , 2004, International Journal of Information Security.

[17]  Stefan Katzenbeisser,et al.  Privacy preserving error resilient dna searching through oblivious automata , 2007, CCS '07.

[18]  Payman Mohassel,et al.  Longest common subsequence as private search , 2009, WPES '09.

[19]  Silvio Micali,et al.  Probabilistic Encryption , 1984, J. Comput. Syst. Sci..

[20]  P. Bohley,et al.  Biochemical Nomenclature and Related Documents: International Union of Biochemistry Pergamon; New York, Oxford, 1978 vi + 224 pages , 1980 .

[21]  Murat Kantarcioglu,et al.  A Cryptographic Approach to Securely Share and Query Genomic Sequences , 2008, IEEE Transactions on Information Technology in Biomedicine.

[22]  Stefan Katzenbeisser,et al.  Privacy-Preserving Matching of DNA Profiles , 2008, IACR Cryptol. ePrint Arch..

[23]  Chris Clifton,et al.  Updating outsourced anatomized private databases , 2013, EDBT '13.

[24]  Zhen Lin,et al.  Genomic Research and Human Subject Privacy , 2004, Science.

[25]  Latanya Sweeney,et al.  Identifying Participants in the Personal Genome Project by Name , 2013, ArXiv.

[26]  Mikhail J. Atallah,et al.  Secure and Efficient Outsourcing of Sequence Comparisons , 2012, ESORICS.