EFFICIENT DNA MOTIF DISCOVERY USING MODIFIED GENETIC ALGORITHM

In this study, a new genetic algorithm was developed to discover the best motifs in a set of DNA sequences. The main steps were: finding the potential positions in each sequence by using few voters (1–5 sequences), constructing the chromosomes from the potential positions, evaluating the fitness for each gene (position) and for each chromosome, calculating the new random distribution, and using the new distribution to generate the next generation. To verify the effectiveness of the proposed algorithm, several real and artificial datasets were used; the results are compared to the standard genetic algorithm, and Gibbs, MEME, and consensus algorithms. Although all the algorithms have low correlation with the correct motifs, the new algorithm exhibits higher accuracy, without sacrificing implementation time.

[1]  A. Sandelin,et al.  Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. , 2004, Journal of molecular biology.

[2]  William J. Fitzgerald,et al.  A sequential Monte Carlo EM approach to the transcription factor binding site identification problem , 2007, Bioinform..

[3]  Bin Ma,et al.  On the closest string and substring problems , 2002, JACM.

[4]  Yu Liang,et al.  fdrMotif: identifying cis-elements by an EM algorithm coupled with false discovery rate control , 2008, Bioinform..

[5]  G. Stormo Maximally Efficient Modeling of DNA Sequence Motifs at All Levels of Complexity , 2011, Genetics.

[6]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[7]  Hsien-Da Huang,et al.  Identifying transcriptional regulatory sites in the human genome using an integrated system. , 2004, Nucleic acids research.

[8]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[9]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[10]  Huaguang Zhang,et al.  Motif discoveries in unaligned molecular sequences using self-organizing neural networks , 2006, IEEE Trans. Neural Networks.

[11]  Dianhui Wang,et al.  An Improved Genetic Algorithm for DNA Motif Discovery with Public Domain Information , 2008, ICONIP.

[12]  Alberto Apostolico,et al.  Incremental discovery of the irredundant motif bases for all suffixes of a string in O(n2logn) time , 2008, Theor. Comput. Sci..

[13]  Chengpeng Bi A Monte Carlo EM Algorithm for De Novo Motif Discovery in Biomolecular Sequences , 2009, TCBB.

[14]  Hui Liu,et al.  Tmod: toolbox of motif discovery , 2010, Bioinform..

[15]  Gary D Stormo,et al.  Motif discovery using expectation maximization and Gibbs' sampling. , 2010, Methods in molecular biology.

[16]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[17]  Mathieu Blanchette,et al.  Seeder: discriminative seeding DNA motif discovery , 2008, Bioinform..

[18]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[19]  Mona Singh,et al.  A Compact Mathematical Programming Formulation for DNA Motif Finding , 2006, CPM.

[20]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[21]  Gonzalo Navarro,et al.  Improving an Algorithm for Approximate Pattern Matching , 2001, Algorithmica.

[22]  Alexander E. Kel,et al.  TRANSCompel®: a database on composite regulatory elements in eukaryotic genes , 2002, Nucleic Acids Res..

[23]  Bin Ma,et al.  More Efficient Algorithms for Closest String and Substring Problems , 2009, SIAM J. Comput..

[24]  Jacques van Helden,et al.  Gene expression info-gibbs : a motif discovery algorithm that directly optimizes information content during sampling , 2009 .

[25]  Kazuhito Shida,et al.  GibbsST: a Gibbs sampling method for motif discovery with enhanced resistance to local optima , 2006, BMC Bioinformatics.

[26]  Mathieu Blanchette,et al.  PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences , 2004, BMC Bioinformatics.