A Codon Frequency Obfuscation Heuristic for Raw Genomic Data Privacy

Genomic data provides clinical researchers with vast opportunities to study various patient ailments. Yet the same data contains revealing information, some of which a patient might want to remain concealed. The question then arises: How can an entity transact in full DNA sequence data while concealing certain sensitive pieces of information in the genome sequence and maintaining DNA data utility? As a response to this question, we propose a codon frequency obfuscation heuristic, in which a redistribution of codon frequency values with highly expressed genes is done in the same amino acid group, generating an obfuscated DNA sequence. Our preliminary results show that it might be possible to publish an obfuscated DNA sequence with a desired level of similarity (utility) to the original DNA sequence.

[1]  Bradley Malin,et al.  Technical Evaluation: An Evaluation of the Current State of Genomic Data Privacy Protection Technology and a Roadmap for the Future , 2004, J. Am. Medical Informatics Assoc..

[2]  Dominik Heider,et al.  DNA-based watermarks using the DNA-Crypt algorithm , 2007, BMC Bioinformatics.

[3]  K. El Emam,et al.  Methods for the de-identification of electronic health records for genomic research , 2011, Genome Medicine.

[4]  Vicenç Torra,et al.  Data privacy , 2014, Advanced Research in Data Privacy.

[5]  Murat Kantarcioglu,et al.  A Cryptographic Approach to Securely Share and Query Genomic Sequences , 2008, IEEE Transactions on Information Technology in Biomedicine.

[6]  Adam Molyneaux,et al.  Privacy-Preserving Processing of Raw Genomic Data , 2013, DPM/SETOP.

[7]  Chin-Chen Chang,et al.  A NEW DATA HIDING SCHEME BASED ON DNA SEQUENCE , 2011 .

[8]  Bradley Malin,et al.  How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems , 2004, J. Biomed. Informatics.

[9]  B A Malin,et al.  Protecting Genomic Sequence Anonymity with Generalization Lattices , 2005, Methods of Information in Medicine.

[10]  D. Blough,et al.  A Robust Data-obfuscation Approach for Privacy Preservation of Clustered Data , 2005 .

[11]  Jean-Pierre Hubaux,et al.  Privacy-Enhancing Technologies for Medical Tests Using Genomic Data , 2013, NDSS.

[12]  Rathindra Sarathy,et al.  A theoretical basis for perturbation methods , 2003, Stat. Comput..

[13]  J. Lee What was genomics? , 2003, The Lancet. Oncology.

[14]  Stephen E. Fienberg,et al.  Data Swapping: Variations on a Theme by Dalenius and Reiss , 2004, Privacy in Statistical Databases.

[15]  Rathindra Sarathy,et al.  Why Swap When You Can Shuffle? A Comparison of the Proximity Swap and Data Shuffle for Numeric Data , 2006, Privacy in Statistical Databases.

[16]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[17]  J. Badge DNA sequencing. , 1998, Methods in molecular biology.