A private DNA motif finding algorithm

With the increasing availability of genomic sequence data, numerous methods have been proposed for finding DNA motifs. The discovery of DNA motifs serves a critical step in many biological applications. However, the privacy implication of DNA analysis is normally neglected in the existing methods. In this work, we propose a private DNA motif finding algorithm in which a DNA owner's privacy is protected by a rigorous privacy model, known as ∊-differential privacy. It provides provable privacy guarantees that are independent of adversaries' background knowledge. Our algorithm makes use of the n-gram model and is optimized for processing large-scale DNA sequences. We evaluate the performance of our algorithm over real-life genomic data and demonstrate the promise of integrating privacy into DNA motif finding.

[1]  Ting Wang,et al.  Combining phylogenetic data with co-regulated genes to identify regulatory motifs , 2003, Bioinform..

[2]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[3]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[4]  Stephen E. Fienberg,et al.  Privacy Preserving GWAS Data Sharing , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[5]  Jeffrey F. Naughton,et al.  On differentially private frequent itemset mining , 2012, Proc. VLDB Endow..

[6]  E. Koonin,et al.  Prediction of transcription regulatory sites in Archaea by a comparative genomic approach. , 2000, Nucleic acids research.

[7]  Adam D. Smith,et al.  Discovering frequent patterns in sensitive data , 2010, KDD.

[8]  Rong-Ming Chen,et al.  FMGA: finding motifs by genetic algorithm , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[9]  T. Hubbard,et al.  NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence , 2005, Nucleic acids research.

[10]  Huaguang Zhang,et al.  Motif discoveries in unaligned molecular sequences using self-organizing neural networks , 2006, IEEE Trans. Neural Networks.

[11]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[12]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[13]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[14]  Kazuhito Shida,et al.  GibbsST: a Gibbs sampling method for motif discovery with enhanced resistance to local optima , 2006, BMC Bioinformatics.

[15]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[16]  Jeremy Buhler,et al.  Finding Motifs Using Random Projections , 2002, J. Comput. Biol..

[17]  Rodger Staden,et al.  Methods for discovering novel motifs in nucleic acid sequences , 1989, Comput. Appl. Biosci..

[18]  Mathieu Blanchette,et al.  PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences , 2004, BMC Bioinformatics.

[19]  Michael D. Hendy,et al.  Mutation and Evolutionary Rates in Adélie Penguins from the Antarctic , 2008, PLoS genetics.

[20]  Bradley Malin,et al.  Technical Evaluation: An Evaluation of the Current State of Genomic Data Privacy Protection Technology and a Roadmap for the Future , 2004, J. Am. Medical Informatics Assoc..

[21]  Martin Tompa,et al.  An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem , 1999, ISMB.

[22]  L. Fulton,et al.  Finding Functional Features in Saccharomyces Genomes by Phylogenetic Footprinting , 2003, Science.

[23]  George J Annas,et al.  DNA testing, banking, and genetic privacy. , 2006, The New England journal of medicine.

[24]  Ajay N. Jain,et al.  A deterministic motif finding algorithm with application to the human genome , 2006, Bioinform..

[25]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[26]  Claude Castelluccia,et al.  Differentially private sequential data publication via variable-length n-grams , 2012, CCS.

[27]  Philip E. Bourne,et al.  Multipolar representation of protein structure , 2006, BMC Bioinformatics.

[28]  Daisuke Kihara,et al.  EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences , 2006, BMC Bioinformatics.

[29]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[30]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[31]  Vitaly Shmatikov,et al.  Privacy-preserving data exploration in genome-wide association studies , 2013, KDD.

[32]  Saurabh Sinha,et al.  A Statistical Method for Finding Transcription Factor Binding Sites , 2000, ISMB.

[33]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[34]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[35]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[36]  J. Liu,et al.  Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. , 2001, Nucleic acids research.

[37]  T. Graves,et al.  Surveying Saccharomyces genomes to identify functional elements by comparative DNA sequence analysis. , 2001, Genome research.

[38]  Benjamin C. M. Fung,et al.  Differentially private transit data publication: a case study on the montreal transportation system , 2012, KDD.

[39]  Kenneth H. Rosen,et al.  Applications of Discrete Mathematics , 1991 .

[40]  Eugene Berezikov,et al.  CONREAL: conserved regulatory elements anchored alignment algorithm for identification of transcription factor binding sites by phylogenetic footprinting. , 2003, Genome research.

[41]  Philip S. Yu,et al.  Efficient Discovery of Frequent Approximate Sequential Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[42]  Ninghui Li,et al.  PrivBasis: Frequent Itemset Mining with Differential Privacy , 2012, Proc. VLDB Endow..

[43]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[44]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[45]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.