论文信息 - A private DNA motif finding algorithm

A private DNA motif finding algorithm

With the increasing availability of genomic sequence data, numerous methods have been proposed for finding DNA motifs. The discovery of DNA motifs serves a critical step in many biological applications. However, the privacy implication of DNA analysis is normally neglected in the existing methods. In this work, we propose a private DNA motif finding algorithm in which a DNA owner's privacy is protected by a rigorous privacy model, known as ∊-differential privacy. It provides provable privacy guarantees that are independent of adversaries' background knowledge. Our algorithm makes use of the n-gram model and is optimized for processing large-scale DNA sequences. We evaluate the performance of our algorithm over real-life genomic data and demonstrate the promise of integrating privacy into DNA motif finding.

[1] Ting Wang,et al. Combining phylogenetic data with co-regulated genes to identify regulatory motifs , 2003, Bioinform..

[2] Chuong B. Do,et al. Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[3] Cynthia Dwork,et al. Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[4] Stephen E. Fienberg,et al. Privacy Preserving GWAS Data Sharing , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[5] Jeffrey F. Naughton,et al. On differentially private frequent itemset mining , 2012, Proc. VLDB Endow..

[6] E. Koonin,et al. Prediction of transcription regulatory sites in Archaea by a comparative genomic approach. , 2000, Nucleic acids research.

[7] Adam D. Smith,et al. Discovering frequent patterns in sensitive data , 2010, KDD.

[8] Rong-Ming Chen,et al. FMGA: finding motifs by genetic algorithm , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[9] T. Hubbard,et al. NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence , 2005, Nucleic acids research.

[10] Huaguang Zhang,et al. Motif discoveries in unaligned molecular sequences using self-organizing neural networks , 2006, IEEE Trans. Neural Networks.

[11] Frank McSherry,et al. Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[12] B. Birren,et al. Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[13] Ramakrishnan Srikant,et al. Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[14] Kazuhito Shida,et al. GibbsST: a Gibbs sampling method for motif discovery with enhanced resistance to local optima , 2006, BMC Bioinformatics.

[15] Eleazar Eskin,et al. Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[16] Jeremy Buhler,et al. Finding Motifs Using Random Projections , 2002, J. Comput. Biol..

[17] Rodger Staden,et al. Methods for discovering novel motifs in nucleic acid sequences , 1989, Comput. Appl. Biosci..

[18] Mathieu Blanchette,et al. PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences , 2004, BMC Bioinformatics.

[19] Michael D. Hendy,et al. Mutation and Evolutionary Rates in Adélie Penguins from the Antarctic , 2008, PLoS genetics.

[20] Bradley Malin,et al. Technical Evaluation: An Evaluation of the Current State of Genomic Data Privacy Protection Technology and a Roadmap for the Future , 2004, J. Am. Medical Informatics Assoc..

[21] Martin Tompa,et al. An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem , 1999, ISMB.

[22] L. Fulton,et al. Finding Functional Features in Saccharomyces Genomes by Phylogenetic Footprinting , 2003, Science.

[23] George J Annas,et al. DNA testing, banking, and genetic privacy. , 2006, The New England journal of medicine.

[24] Ajay N. Jain,et al. A deterministic motif finding algorithm with application to the human genome , 2006, Bioinform..

[25] J. Collado-Vides,et al. Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[26] Claude Castelluccia,et al. Differentially private sequential data publication via variable-length n-grams , 2012, CCS.

[27] Philip E. Bourne,et al. Multipolar representation of protein structure , 2006, BMC Bioinformatics.

[28] Daisuke Kihara,et al. EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences , 2006, BMC Bioinformatics.

[29] S. Nelson,et al. Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[30] Dmitrij Frishman,et al. MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[31] Vitaly Shmatikov,et al. Privacy-preserving data exploration in genome-wide association studies , 2013, KDD.

[32] Saurabh Sinha,et al. A Statistical Method for Finding Transcription Factor Binding Sites , 2000, ISMB.

[33] H. K. Dai,et al. A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[34] William Stafford Noble,et al. Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[35] J. Stoye,et al. REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[36] J. Liu,et al. Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. , 2001, Nucleic acids research.

[37] T. Graves,et al. Surveying Saccharomyces genomes to identify functional elements by comparative DNA sequence analysis. , 2001, Genome research.

[38] Benjamin C. M. Fung,et al. Differentially private transit data publication: a case study on the montreal transportation system , 2012, KDD.

[39] Kenneth H. Rosen,et al. Applications of Discrete Mathematics , 1991 .

[40] Eugene Berezikov,et al. CONREAL: conserved regulatory elements anchored alignment algorithm for identification of transcription factor binding sites by phylogenetic footprinting. , 2003, Genome research.

[41] Philip S. Yu,et al. Efficient Discovery of Frequent Approximate Sequential Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[42] Ninghui Li,et al. PrivBasis: Frequent Itemset Mining with Differential Privacy , 2012, Proc. VLDB Endow..

[43] Graziano Pesole,et al. An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[44] Eran Halperin,et al. Identifying Personal Genomes by Surname Inference , 2013, Science.

[45] Ramakrishnan Srikant,et al. Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.