DNA Motif Finding Method Without Protection Can Leak User Privacy

DNA sequence analysis plays an important role in the study of gene regulatory networks. DNA motif finding has become a key discipline in the post-gene era and gradually become a research hotspot by mining key gene sequences corresponding to disease mechanism and important biological functions. However, the research of DNA motif finding is faced with a huge problem of privacy disclosure. DNA motif finding technology cannot manage and use data well under controllable conditions, and the mining process of DNA motif finding itself is prone to reveal private information such as individual traits, characteristics and disease defects. In this paper, we presented an overview of the privacy breaching of DNA motif finding, summarized the main methods and tools of the current DNA motif finding, analyzed its privacy risks, and used two case studies to verify that the DNA motif finding may identify individual privacy information. Finally, we discussed the privacy protection methods for motif finding and proposed the privacy protection solutions.

[1]  Misha Angrist Genetic privacy needs a more nuanced approach , 2013, Nature.

[2]  Anirban Mukherjee,et al.  On the Monte-Carlo Expectation Maximization for Finding Motifs in DNA Sequences , 2015, IEEE Journal of Biomedical and Health Informatics.

[3]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[4]  Lonnie R. Welch,et al.  Discovering Gene Regulatory Elements Using Coverage-Based Heuristics , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[6]  Liang Wang,et al.  A differential privacy DNA motif finding method based on closed frequent patterns , 2019, Cluster Computing.

[7]  C. Bustamante,et al.  Privacy Risks from Genomic Data-Sharing Beacons , 2015, American journal of human genetics.

[8]  Young Jin Choi,et al.  KLOTHO gene polymorphism is associated with coronary artery stenosis but not with coronary calcification in a Korean population. , 2009, International heart journal.

[9]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[10]  Chun-Hsi Huang,et al.  A survey of motif finding Web tools for detecting binding site motifs in ChIP-Seq data , 2014, Biology Direct.

[11]  Mona Singh,et al.  A combinatorial optimization approach for diverse motif finding applications , 2006, Algorithms for Molecular Biology.

[12]  Christina Boucher,et al.  Fast motif recognition via application of statistical thresholds , 2010, BMC Bioinformatics.

[13]  Mark Gerstein,et al.  Analysis of sensitive information leakage in functional genomics signal profiles through genomic deletions , 2018, Nature Communications.

[14]  Alexis B. Carter,et al.  Considerations for Genomic Data Privacy and Security when Working in the Cloud. , 2019, The Journal of molecular diagnostics : JMD.

[15]  Bradley Malin,et al.  Protecting DNA Sequence Anonymity with Generalization Lattices , 2004 .

[16]  Saharon Rosset,et al.  Optimal Set Cover Formulation for Exclusive Row Biclustering of Gene Expression , 2014, Journal of Computer Science and Technology.

[17]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[18]  Xiaoyu Zhang,et al.  Efficient computation of motif discovery on Intel Many Integrated Core (MIC) Architecture , 2018, BMC Bioinformatics.

[19]  Xiaoyan Zhu,et al.  Cloud-assisted privacy-preserving genetic paternity test , 2015, 2015 IEEE/CIC International Conference on Communications in China (ICCC).

[20]  Sabyasachi Patra,et al.  Motif discovery in biological network using expansion tree , 2018, J. Bioinform. Comput. Biol..

[21]  Thomas May Sociogenetic Risks - Ancestry DNA Testing, Third-Party Identity, and Protection of Privacy. , 2018, The New England journal of medicine.

[22]  Somesh Jha,et al.  Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing , 2014, USENIX Security Symposium.

[23]  Wayne A. Gordon,et al.  Genetic Data Sharing and Privacy , 2014, Neuroinformatics.

[24]  Haixu Tang,et al.  Learning your identity and disease from research papers: information leaks in genome wide association study , 2009, CCS.

[25]  G. Pesole,et al.  WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences. , 1992, Nucleic acids research.

[26]  Christina Boucher,et al.  A Graph Clustering Approach to Weak Motif Recognition , 2007, WABI.

[27]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[28]  Shoudan Liang,et al.  cWINNOWER algorithm for finding fuzzy DNA motifs , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[29]  Kenta Nakai,et al.  A Genetic Algorithm for Motif Finding Based on Statistical Significance , 2015, IWBBIO.

[30]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.

[31]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[32]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[33]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[34]  Man Lung Yiu,et al.  Quick-motif: An efficient and scalable framework for exact motif discovery , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[35]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[36]  Carl A. Gunter,et al.  Privacy in the Genomic Era , 2014, ACM Comput. Surv..

[37]  Jean-Pierre Hubaux,et al.  Protecting Privacy and Security of Genomic Data in i2b2 with Homomorphic Encryption and Differential Privacy , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[38]  Yongchao Liu,et al.  CompleteMOTIFs: DNA motif discovery platform for transcription factor binding experiments , 2010, Bioinform..

[39]  Alessandro Blasimme,et al.  Genes wide open: Data sharing and the social gradient of genomic privacy , 2018, AJOB empirical bioethics.

[40]  C. Dwork,et al.  Exposed! A Survey of Attacks on Private Data , 2017, Annual Review of Statistics and Its Application.

[41]  Agusti Solanas,et al.  Privacy-Aware Genome Mining: Server-Assisted Protocols for Private Set Intersection and Pattern Matching , 2015, 2015 IEEE 28th International Symposium on Computer-Based Medical Systems.

[42]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[44]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[45]  Bradley Malin,et al.  Re-identification of Familial Database Records , 2006, AMIA.

[46]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[47]  Yael Bregman-Eschet,et al.  Genetic Databases and Biobanks: Who Controls Our Genetic Privacy? , 2006 .

[48]  P R Burton,et al.  Gibbs sampling–based segregation analysis of asthma‐associated quantitative traits in a population‐based sample of nuclear families , 2001, Genetic epidemiology.

[49]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[50]  O. Devuyst,et al.  The 1000 Genomes Project: Welcome to a New World , 2015, Peritoneal Dialysis International.

[51]  E. Kirkness,et al.  Mobile elements create structural variation: analysis of a complete human genome. , 2009, Genome research.

[52]  H Niemann,et al.  Identification and analysis of eukaryotic promoters: recent computational approaches. , 2001, Trends in genetics : TIG.

[53]  Prahlad T. Ram,et al.  A pan-cancer proteomic perspective on The Cancer Genome Atlas , 2014, Nature Communications.

[54]  Zhiping Weng,et al.  Motif Finding. , 2017, Cold Spring Harbor protocols.

[55]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[56]  S. Chaturvedi,et al.  Charge calculation studies done on a single walled carbon nanotube using MOPAC , 2018 .

[57]  Fuqiang Sun,et al.  Association of TBX20 Gene Polymorphism with Congenital Heart Disease in Han Chinese Neonates , 2015, Pediatric Cardiology.

[58]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[59]  Hiroshi Tanaka,et al.  Security controls in an integrated Biobank to protect privacy in data sharing: rationale and study design , 2017, BMC Medical Informatics and Decision Making.

[60]  Ants Nomper Genetic Databases: Socio-Ethical Issues in the Collection and Use of DNA , 2004 .

[61]  Martin Vingron,et al.  Natural similarity measures between position frequency matrices with an application to clustering , 2008, Bioinform..

[62]  N. Hawkins,et al.  Ethical implications of the use of whole genome methods in medical research , 2009, European Journal of Human Genetics.

[63]  Michael Q. Zhang,et al.  Similarity of position frequency matrices for transcription factor binding sites , 2005, Bioinform..

[64]  Ka-chun Wong,et al.  DNA Motif Recognition Modeling from Protein Sequences , 2018, iScience.

[65]  Hu Xiu-zhen An Improved Method for Predicting Structure Class of 27-Class Protein Folds Using Increment of Diversity , 2009 .

[66]  J. Gitschier,et al.  Inferential genotyping of Y chromosomes in Latter-Day Saints founders and comparison to Utah samples in the HapMap project. , 2009, American journal of human genetics.

[67]  Murat Kantarcioglu,et al.  Controlling the signal: Practical privacy protection of genomic data sharing through Beacon services , 2017, BMC Medical Genomics.

[68]  Dianhui Wang,et al.  A comprehensive survey on genetic algorithms for DNA motif prediction , 2018, Inf. Sci..

[69]  Eun Yong Kang,et al.  Identification of individuals by trait prediction using whole-genome sequencing data , 2017, Proceedings of the National Academy of Sciences.

[70]  Bradley Malin,et al.  Technical Evaluation: An Evaluation of the Current State of Genomic Data Privacy Protection Technology and a Roadmap for the Future , 2004, J. Am. Medical Informatics Assoc..

[71]  Michelle L. McGowan,et al.  Big data, open science and the brain: lessons learned from genomics , 2014, Front. Hum. Neurosci..