On the use of algorithms to discover motifs in DNA sequences

Many approaches are currently devoted to find DNA motifs in nucleotide sequences. However, this task remains challenging for specialists nowadays due to the difficulties they find to deeply understand gene regulatory mechanisms, especially when analyzing binding sites in DNA. These sites or specific nucleotide sequences are known to be responsible for transcription processes. Thus, this work aims at providing an updated overview on strategies developed to discover meaningful motifs in DNA-related sequences, and, in particular, their attempts to find out relevant binding sites. From all existing approaches, this work is focused on dictionary, ensemble, and artificial intelligence-based algorithms since they represent the classical and the leading ones, respectively.

[1]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[2]  Philip Machanick,et al.  The value of position-specific priors in motif discovery using MEME , 2010, BMC Bioinformatics.

[3]  E. Birney,et al.  EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.

[4]  Daisuke Kihara,et al.  EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences , 2006, BMC Bioinformatics.

[5]  Peter Wayner,et al.  Disappearing Cryptography: Information Hiding: Steganography and Watermarking (2nd Edition) , 2002 .

[6]  Yi Pan,et al.  FIK Model: Novel Efficient Granular Computing Model for Protein Sequence Motifs and Structure Information Discovery , 2006, Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06).

[7]  Marie-France Sagot,et al.  Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification , 2000, J. Comput. Biol..

[8]  M. Sagot,et al.  Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma(80) family of promoter signals. , 2000, Journal of molecular biology.

[9]  Robertas Damasevicius Structural analysis of regulatory DNA sequences using grammar inference and Support Vector Machine , 2010, Neurocomputing.

[10]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[11]  Dianhui Wang,et al.  SOMEA: self-organizing map based extraction algorithm for DNA motif identification with heterogeneous model , 2011, BMC Bioinformatics.

[12]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[13]  Ioannis P. Androulakis,et al.  Recent Advances in the Computational Discovery of Transcription Factor Binding Sites , 2009, Algorithms.

[14]  Martin Tompa,et al.  An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem , 1999, ISMB.

[15]  R. Guigó,et al.  EGASP: collaboration through competition to find human genes , 2005, Nature Methods.

[16]  Marie-France Sagot,et al.  Spelling Approximate Repeated or Common Motifs Using a Suffix Tree , 1998, LATIN.

[17]  M. Bjørås,et al.  DNA binding kinetics of two response regulators, PlnC and PlnD, from the bacteriocin regulon of Lactobacillus plantarum C11 , 2009, BMC Biochemistry.

[18]  Weixiong Zhang,et al.  WordSpy: identifying transcription factor binding motifs by building a dictionary and learning a grammar , 2005, Nucleic Acids Res..

[19]  I. Jonassen,et al.  Predicting gene regulatory elements in silico on a genomic scale. , 1998, Genome research.

[20]  Aaron Golden,et al.  Transcription factor binding site identification using the self-organizing map , 2005, Bioinform..

[21]  B. De Moor,et al.  The Effect of Orthology and Coregulation on Detecting Regulatory Motifs , 2010, PloS one.

[22]  Aaron Golden,et al.  Self-organizing neural networks to support the discovery of DNA-binding motifs , 2006, Neural Networks.

[23]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[24]  Jia-wei Luo,et al.  Motif discovery using an immune genetic algorithm. , 2010, Journal of theoretical biology.

[25]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[26]  A. Sharov,et al.  Exhaustive Search for Over-represented DNA Sequence Motifs with CisFinder , 2009, DNA research : an international journal for rapid publication of reports on genes and genomes.

[27]  Giorgio Valentini,et al.  Classification of co-expressed genes from DNA regulatory regions , 2009, Inf. Fusion.

[28]  Dianhui Wang,et al.  SOMIX: Motifs Discovery in Gene Regulatory Sequences Using Self-Organizing Maps , 2010, ICONIP.

[29]  Igor Zwir,et al.  Optimization of multi-classifiers for computational biology: application to gene finding and expression , 2010 .

[30]  Jianhua Ruan,et al.  Finding Gapped Motifs by a Novel Evolutionary Algorithm , 2010, EvoBIO.

[31]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[32]  Yi Pan,et al.  Novel efficient granular computing models for protein sequence motifs and structure information discovery , 2009, Int. J. Comput. Biol. Drug Des..

[33]  Gajendra P. S. Raghava,et al.  Identification of DNA-binding proteins using support vector machines and evolutionary profiles , 2007, BMC Bioinformatics.

[34]  Shane T. Jensen,et al.  Bayesian Clustering of Transcription Factor Binding Motifs , 2006, math/0610655.

[35]  M. Sadeghi,et al.  Genetic algorithm for dyad pattern finding in DNA sequences. , 2009, Genes & genetic systems.

[36]  Yi Pan,et al.  Mining protein sequence motifs representing common 3D structures , 2005, 2005 IEEE Computational Systems Bioinformatics Conference - Workshops (CSBW'05).

[37]  Patrick Suppes,et al.  Naive Set Theory , 1961 .

[38]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[39]  Michael J. Palumbo,et al.  Phyloscan: locating transcription-regulating binding sites in mixed aligned and unaligned sequence data , 2010, Nucleic Acids Res..

[40]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[41]  B. Ren,et al.  Genome-wide prediction of transcription factor binding sites using an integrated model , 2010, Genome Biology.

[42]  Ari Löytynoja,et al.  MATLIGN: a motif clustering, comparison and matching tool , 2007, BMC Bioinformatics.

[43]  Siu-Ming Yiu,et al.  Detection of generic spaced motifs using submotif pattern mining , 2007, Bioinform..

[44]  Qian Liu,et al.  Evigan: a hidden variable model for integrating gene evidence for eukaryotic gene prediction , 2008, Bioinform..

[45]  Turgay Ibrikci,et al.  Fuzzy C-Means Based DNA Motif Discovery , 2008, ICIC.

[46]  H. Bussemaker,et al.  Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[47]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[48]  Jonathan E. Allen,et al.  Computational gene prediction using multiple sources of evidence. , 2003, Genome research.

[49]  Michael Q. Zhang,et al.  OSCAR: One-class SVM for accurate recognition of cis-elements , 2007, Bioinform..