Self-organizing neural networks to support the discovery of DNA-binding motifs

Identification of the short DNA sequence motifs that serve as binding targets for transcription factors is an important challenge in bioinformatics. Unsupervised techniques from the statistical learning theory literature have often been applied to motif discovery, but effective solutions for large genomic datasets have yet to be found. We present here three self-organizing neural networks that have applicability to the motif-finding problem. The core system in this study is a previously described SOM-based motif-finder named SOMBRERO. The motif-finder is integrated in this work with a SOM-based method that automatically constructs generalized models for structurally related motifs and initializes SOMBRERO with relevant biological knowledge. A self-organizing tree method that displays the relationships between various motifs is also presented, and it is shown that such a method can act as an effective structural classifier of novel motifs. The performance of the three self-organizing neural networks is evaluated here using various datasets.

[1]  Donald C. Wunsch,et al.  Application of the method of elastic maps in analysis of genetic texts , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[2]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[3]  S. Kanaya,et al.  A novel bioinformatic strategy for unveiling hidden genome signatures of eukaryotes: self-organizing map of oligonucleotide frequency. , 2002, Genome informatics. International Conference on Genome Informatics.

[4]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[5]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[6]  Michael Q. Zhang,et al.  Similarity of position frequency matrices for transcription factor binding sites , 2005, Bioinform..

[7]  W. H. Day,et al.  Critical comparison of consensus methods for molecular sequences. , 1992, Nucleic acids research.

[8]  Aaron Golden,et al.  Transcription factor binding site identification using the self-organizing map , 2005, Bioinform..

[9]  Alexander N. Gorban,et al.  Seven clusters in genomic triplet distributions , 2003, Silico Biol..

[10]  Samuel Kaski,et al.  Grouping and visualizing human endogenous retroviruses by bootstrapping median self-organizing maps , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[11]  Gary D. Stormo,et al.  Identifying target sites for cooperatively binding factors , 2001, Bioinform..

[12]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[13]  Ting Wang,et al.  Combining phylogenetic data with co-regulated genes to identify regulatory motifs , 2003, Bioinform..

[14]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[15]  T. Abe,et al.  Direct cloning of genes encoding novel xylanases from the human gut. , 2005, Canadian journal of microbiology.

[16]  Mark J. Embrechts,et al.  DNA classifications with self-organizing maps (SOMs) , 2003, Proceedings of the 2003 IEEE International Workshop on Soft Computing in Industrial Applications, 2003. SMCia/03..

[17]  Jun S. Liu,et al.  Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model , 2003 .

[18]  S. Pietrokovski Searching databases of conserved sequence regions by aligning protein multiple-alignments. , 1996, Nucleic acids research.

[19]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[20]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[21]  Thomas Werner,et al.  MatInspector and beyond: promoter analysis based on transcription factor binding sites , 2005, Bioinform..

[22]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[23]  G. Stormo,et al.  ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[24]  Haiying Wang,et al.  Interactive GSOM-Based Approaches for Improving Biomedical Pattern Discovery and Visualization , 2004, CIS.

[25]  João Aires-de-Sousa,et al.  Representation of DNA sequences with virtual potentials and their processing by (SEQREP) Kohonen self-organizing maps , 2003, Bioinform..

[26]  Patrizio Arrigo,et al.  Identification of a new motif on nucleic acid sequence data using Kohonen's self-organizing map , 1991, Comput. Appl. Biosci..

[27]  Holger Karas,et al.  TRANSFAC: a database on transcription factors and their DNA binding sites , 1996, Nucleic Acids Res..

[28]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[29]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[30]  James O. McInerney,et al.  Gene prediction using the Self-Organizing Map: automatic generation of multiple gene models , 2004, BMC Bioinformatics.

[31]  G. Stormo,et al.  Additivity in protein-DNA interactions: how good an approximation is it? , 2002, Nucleic acids research.

[32]  R. Sokal,et al.  A QUANTITATIVE APPROACH TO A PROBLEM IN CLASSIFICATION† , 1957, Evolution; International Journal of Organic Evolution.

[33]  H. Douzono,et al.  A design method of DNA chips for SNP analysis using self-organizing maps , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[34]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[35]  J. Dopazo,et al.  Phylogenetic Reconstruction Using an Unsupervised Growing Neural Network That Adopts the Topology of a Phylogenetic Tree , 1997, Journal of Molecular Evolution.

[36]  Aaron Golden,et al.  Improved detection of DNA motifs using a self-organized clustering of familial binding profiles , 2005, ISMB.

[37]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[38]  J. Badger,et al.  Analysis of codon usage patterns of bacterial genomes using the self-organizing map. , 2001, Molecular biology and evolution.

[39]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[40]  Shigehiko Kanaya,et al.  Informatics for unveiling hidden genome signatures. , 2003, Genome research.

[41]  H. Bussemaker,et al.  Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Terence P. Speed,et al.  Finding Short DNA Motifs Using Permuted Markov Models , 2005, J. Comput. Biol..

[43]  S. Kanaya,et al.  Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome. , 2001, Gene.

[44]  Martin Vingron,et al.  T-Reg Comparator: an analysis tool for the comparison of position weight matrices , 2005, Nucleic Acids Res..

[45]  John C. Wootton,et al.  Discovering Simple Regions in Biological Sequences Associated with Scoring Schemes , 2003, J. Comput. Biol..

[46]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[47]  J. Felsenstein Maximum-likelihood estimation of evolutionary trees from continuous characters. , 1973, American journal of human genetics.

[48]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[49]  Samuel Kaski,et al.  Self-organizing map-based discovery and visualization of human endogenous retroviral sequence groups , 2005, Int. J. Neural Syst..

[50]  Bart De Moor,et al.  Computational detection of cis-regulatory modules , 2003, ECCB.

[51]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[52]  A. Sandelin,et al.  Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. , 2004, Journal of molecular biology.

[53]  Mona Singh,et al.  Comparative analysis of methods for representing and searching for transcription factor binding sites , 2004, Bioinform..