GANN: Genetic algorithm neural networks for the detection of conserved combinations of features in DNA

BackgroundThe multitude of motif detection algorithms developed to date have largely focused on the detection of patterns in primary sequence. Since sequence-dependent DNA structure and flexibility may also play a role in protein-DNA interactions, the simultaneous exploration of sequence- and structure-based hypotheses about the composition of binding sites and the ordering of features in a regulatory region should be considered as well. The consideration of structural features requires the development of new detection tools that can deal with data types other than primary sequence.ResultsGANN (available at http://bioinformatics.org.au/gann) is a machine learning tool for the detection of conserved features in DNA. The software suite contains programs to extract different regions of genomic DNA from flat files and convert these sequences to indices that reflect sequence and structural composition or the presence of specific protein binding sites. The machine learning component allows the classification of different types of sequences based on subsamples of these indices, and can identify the best combinations of indices and machine learning architecture for sequence discrimination. Another key feature of GANN is the replicated splitting of data into training and test sets, and the implementation of negative controls. In validation experiments, GANN successfully merged important sequence and structural features to yield good predictive models for synthetic and real regulatory regions.ConclusionGANN is a flexible tool that can search through large sets of sequence and structural feature combinations to identify those that best characterize a set of sequences.

[1]  G. Church,et al.  Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[2]  Pierre Baldi,et al.  Computational Applications of DNA Structural Scales , 1998, ISMB.

[3]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[4]  N. Mackman,et al.  LPS induction of gene expression in human monocytes. , 2001, Cellular signalling.

[5]  Alexander E. Kel,et al.  Automatic Annotation of Genomic Regulatory Sequences by Searching for Composite Clusters , 2001, Pacific Symposium on Biocomputing.

[6]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[7]  D. Goodsell,et al.  "...the tyranny of the lattice...". , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[8]  R. L. Charlebois,et al.  Characterization of species-specific genes using a flexible, web-based querying system. , 2003, FEMS microbiology letters.

[9]  H. Drew,et al.  Sequence periodicities in chicken nucleosome core DNA. , 1986, Journal of molecular biology.

[10]  V. Zhurkin,et al.  DNA sequence-dependent deformability deduced from protein-DNA crystal complexes. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[12]  K. Struhl Fundamentally Different Logic of Gene Regulation in Eukaryotes and Prokaryotes , 1999, Cell.

[13]  Reinhard Lohmann,et al.  Structure evolution and incomplete induction , 1993, Biological Cybernetics.

[14]  Piero Carninci,et al.  Genetic control of the innate immune response , 2003, BMC Immunology.

[15]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[16]  D. Auble,et al.  Promoter recognition by Escherichia coli RNA polymerase. Influence of DNA structure in the spacer separating the -10 and -35 regions. , 1988, Journal of molecular biology.

[17]  G. W. Hatfield,et al.  The role of DNA deformation energy at individual base steps for the identification of DNA-protein binding sites. , 2002, Genome informatics. International Conference on Genome Informatics.

[18]  Martin C. Frith,et al.  Cluster-Buster: finding dense clusters of motifs in DNA sequences , 2003, Nucleic Acids Res..

[19]  T. Steitz,et al.  Crystal lattice packing is important in determining the bend of a DNA dodecamer containing an adenine tract. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[20]  V. Zhurkin,et al.  B-DNA twisting correlates with base-pair morphology. , 1995, Journal of molecular biology.

[21]  D. Bhattacharyya,et al.  Structural basis of DNA flexibility. , 2001, Indian journal of biochemistry & biophysics.

[22]  Steen Knudsen,et al.  Promoter2.0: for the recognition of PolII promoter sequences , 1999, Bioinform..

[23]  Julio Collado-Vides,et al.  RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12 , 2004, Nucleic Acids Res..

[24]  T. Steitz,et al.  A DNA dodecamer containing an adenine tract crystallizes in a unique lattice and exhibits a new bend. , 1993, Journal of molecular biology.

[25]  B. Emerson,et al.  Mechanisms of chromatin assembly and transcription. , 2002, Current opinion in cell biology.

[26]  A. Mazur,et al.  Comparative bending dynamics in DNA with and without regularly repeated adenine tracts. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[27]  R. Lohmann,et al.  A neural network model for the prediction of membrane‐spanning amino acid sequences , 1994, Protein science : a publication of the Protein Society.

[28]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[29]  D. Ayers,et al.  Promoter recognition by Escherichia coli RNA polymerase. Role of the spacer DNA in functional complex formation. , 1989, Journal of molecular biology.

[30]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[31]  G. Stormo,et al.  Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. , 2001, Nucleic acids research.

[32]  G. Stormo,et al.  Additivity in protein-DNA interactions: how good an approximation is it? , 2002, Nucleic acids research.

[33]  M. Wösten Eubacterial sigma-factors. , 1998, FEMS microbiology reviews.

[34]  G. Christian Overton,et al.  Conformational and physicochemical DNA features specific for transcription factor binding sites , 1999, Bioinform..

[35]  Bart De Moor,et al.  A genetic algorithm for the detection of new cis-regulatory modules in sets of coregulated genes , 2004, Bioinform..

[36]  G Schneider,et al.  Structure optimization of an artificial neural filter detecting membrane-spanning amino acid sequences. , 1996, Biopolymers.

[37]  Julio Collado-Vides,et al.  Sigma70 promoters in Escherichia coli: specific transcription in dense regions of overlapping promoter-like signals. , 2003, Journal of molecular biology.

[38]  A. Lemmon,et al.  The metapopulation genetic algorithm: An efficient solution for the problem of large phylogeny estimation , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[39]  D. Higgins,et al.  SAGA: sequence alignment by genetic algorithm. , 1996, Nucleic acids research.

[40]  H. Margalit,et al.  Compilation of E. coli mRNA promoter sequences. , 1993, Nucleic acids research.