Protecting Genomic Privacy by a Sequence-Similarity Based Obfuscation Method

In the post-genomic era, large-scale personal DNA sequences are produced and collected for genetic medical diagnoses and new drug discovery, which, however, simultaneously poses serious challenges to the protection of personal genomic privacy. Existing genomic privacy-protection methods are either time-consuming or with low accuracy. To tackle these problems, this paper proposes a sequence similarity-based obfuscation method, namely IterMegaBLAST, for fast and reliable protection of personal genomic privacy. Specifically, given a randomly selected sequence from a dataset of DNA sequences, we first use MegaBLAST to find its most similar sequence from the dataset. These two aligned sequences form a cluster, for which an obfuscated sequence was generated via a DNA generalization lattice scheme. These procedures are iteratively performed until all of the sequences in the dataset are clustered and their obfuscated sequences are generated. Experimental results on two benchmark datasets demonstrate that under the same degree of anonymity, IterMegaBLAST significantly outperforms existing state-of-the-art approaches in terms of both utility accuracy and time complexity.

[1]  Bradley Malin,et al.  Determining the identifiability of DNA database entries , 2000, AMIA.

[2]  Bart Preneel,et al.  Towards Measuring Anonymity , 2002, Privacy Enhancing Technologies.

[3]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[4]  Xiaohong Su,et al.  Improvements on a privacy-protection algorithm for DNA sequences with generalization lattices , 2012, Comput. Methods Programs Biomed..

[5]  Zhen Lin,et al.  Using binning to maintain confidentiality of medical data , 2002, AMIA.

[6]  Bradley Malin,et al.  Technical Evaluation: An Evaluation of the Current State of Genomic Data Privacy Protection Technology and a Roadmap for the Future , 2004, J. Am. Medical Informatics Assoc..

[7]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[8]  E. Clayton Ethical, legal, and social implications of genomic medicine. , 2003, The New England journal of medicine.

[9]  Michael T. Goodrich,et al.  The Mastermind Attack on Genomic Data , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[10]  Joshua C. Denny,et al.  The disclosure of diagnosis codes can breach research participants' privacy , 2010, J. Am. Medical Informatics Assoc..

[11]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[12]  G A Chase,et al.  Genetic counseling: a consumers' view. , 1972, The New England journal of medicine.

[13]  B A Malin,et al.  Protecting Genomic Sequence Anonymity with Generalization Lattices , 2005, Methods of Information in Medicine.

[14]  Bradley Malin,et al.  How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems , 2004, J. Biomed. Informatics.

[15]  I. C. O. B. Nomenclature,et al.  IUPAC-IUB commission on biochemical nomenclature (CBN). Abbreviations and symbols for nucleic acids, polynucleotides and their constituents. , 1971, Journal of Molecular Biology.

[16]  Christopher G Chute,et al.  Genomic medicine, health information technology, and patient care. , 2013, JAMA.

[17]  Murat Kantarcioglu,et al.  A Cryptographic Approach to Securely Share and Query Genomic Sequences , 2008, IEEE Transactions on Information Technology in Biomedicine.

[18]  Slava Kisilevich,et al.  Efficient Multidimensional Suppression for K-Anonymity , 2010, IEEE Transactions on Knowledge and Data Engineering.

[19]  Jean-Pierre Hubaux,et al.  Addressing the concerns of the lacks family: quantification of kin genomic privacy , 2013, CCS.

[20]  Li Guan Improvement of a method of privacy protection for personal DNA data , 2007 .