Self-Organizing Maps of Position Weight Matrices for Motif Discovery in Biological Sequences

The identification of overrepresented motifs in a collection of biological sequences continues to be a relevant and challenging problem in computational biology. Currently popular methods of motif discovery are based on statistical learning theory. In this paper, a machine-learning approach to the motif discovery problem is explored. The approach is based on a Self-Organizing Map (SOM) where the output layer neuron weight vectors are replaced by position weight matrices. This approach can be used to characterise features present in a set of sequences, and thus can be used as an aid in overrepresented motif discovery. The SOM approach to motif discovery is demonstrated using biological sequence datasets, both real and simulated

[1]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[2]  John C. Wootton,et al.  Discovering Simple Regions in Biological Sequences Associated with Scoring Schemes , 2003, J. Comput. Biol..

[3]  Jun S. Liu,et al.  Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model , 2003 .

[4]  S. Kanaya,et al.  Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome. , 2001, Gene.

[5]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[6]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[7]  Shigehiko Kanaya,et al.  Informatics for unveiling hidden genome signatures. , 2003, Genome research.

[8]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[9]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[10]  J. Badger,et al.  Analysis of codon usage patterns of bacterial genomes using the self-organizing map. , 2001, Molecular biology and evolution.

[11]  James O. McInerney,et al.  Gene prediction using the Self-Organizing Map: automatic generation of multiple gene models , 2004, BMC Bioinformatics.

[12]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[13]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[14]  Panu Somervuo,et al.  How to make large self-organizing maps for nonvectorial data , 2002, Neural Networks.

[15]  Kuo-Chen Chou,et al.  Mining Biological Data Using Self-Organizing Map , 2003, J. Chem. Inf. Comput. Sci..

[16]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[17]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[18]  H. Bussemaker,et al.  Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .