Optimized mixed Markov models for motif identification

BackgroundIdentifying functional elements, such as transcriptional factor binding sites, is a fundamental step in reconstructing gene regulatory networks and remains a challenging issue, largely due to limited availability of training samples.ResultsWe introduce a novel and flexible model, the O ptimized Mi xture Ma rkov model (OMiMa), and related methods to allow adjustment of model complexity for different motifs. In comparison with other leading methods, OMiMa can incorporate more than the NNSplice's pairwise dependencies; OMiMa avoids model over-fitting better than the Permuted Variable Length Markov Model (PVLMM); and OMiMa requires smaller training samples than the Maximum Entropy Model (MEM). Testing on both simulated and actual data (regulatory cis-elements and splice sites), we found OMiMa's performance superior to the other leading methods in terms of prediction accuracy, required size of training data or computational time. Our OMiMa system, to our knowledge, is the only motif finding tool that incorporates automatic selection of the best model. OMiMa is freely available at [1].ConclusionOur optimized mixture of Markov models represents an alternative to the existing methods for modeling dependent structures within a biological motif. Our model is conceptually simple and effective, and can improve prediction accuracy and/or computational speed over other leading methods.

[1]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[2]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[3]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[4]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[5]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[6]  W. Wasserman,et al.  A predictive model for regulatory sequences directing liver-specific transcription. , 2001, Genome research.

[7]  C. Guthrie,et al.  An RNA switch at the 5' splice site requires ATP and the DEAD box protein Prp28p. , 1999, Molecular cell.

[8]  Simon Kasif,et al.  Modeling splice sites with Bayes networks , 2000, Bioinform..

[9]  P. Bucher,et al.  High-throughput SELEX–SAGE method for quantitative modeling of transcription-factor binding sites , 2002, Nature Biotechnology.

[10]  Thangavel Alphonse Thanaraj,et al.  Prediction of Exact Boundaries of Exons , 2000, Briefings Bioinform..

[11]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[12]  E. Wingender,et al.  MATCH: A tool for searching transcription factor binding sites in DNA sequences. , 2003, Nucleic acids research.

[13]  Alexander E. Kel,et al.  MATCHTM: a tool for searching transcription factor binding sites in DNA sequences , 2003, Nucleic Acids Res..

[14]  K. Nandabalan,et al.  Mutations in U1 snRNA bypass the requirement for a cell type-specific RNA splicing factor , 1993, Cell.

[15]  Xin Chen,et al.  TRANSFAC: an integrated system for gene expression regulation , 2000, Nucleic Acids Res..

[16]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[17]  G. Christian Overton,et al.  Oligonucleotide frequency matrices addressed to recognizing functional DNA sites , 1999, Bioinform..

[18]  Elmar Nöth,et al.  Interpolated markov chains for eukaryotic promoter recognition , 1999, Bioinform..

[19]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[20]  Terence P. Speed,et al.  Finding short DNA motifs using permuted markov models , 2004, RECOMB.

[21]  Sònia Casillas,et al.  Conservation of regulatory sequences and gene expression patterns in the disintegrating Drosophila Hox gene complex. , 2005, Genome research.

[22]  Pankaj Agarwal,et al.  Detecting non-adjoining correlations with signals in DNA , 1998, RECOMB '98.

[23]  G. Church,et al.  Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[24]  K. Lindblad-Toh,et al.  Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals , 2005, Nature.

[25]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[26]  M. Green,et al.  Mechanism for cryptic splice site activation during pre-mRNA splicing. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Terence P. Speed,et al.  Finding Short DNA Motifs Using Permuted Markov Models , 2005, J. Comput. Biol..

[28]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[29]  H. Akaike A new look at the statistical model identification , 1974 .

[30]  Gil Ast,et al.  Comparative analysis detects dependencies among the 5' splice-site positions. , 2004, RNA.

[31]  P. Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[32]  Gary D. Stormo,et al.  SAMIE: Statistical Algorithm for Modeling Interaction Energies , 2000, Pacific Symposium on Biocomputing.

[33]  Matthew W. Hahn,et al.  The evolution of transcriptional regulation in eukaryotes. , 2003, Molecular biology and evolution.

[34]  Qing Zhou,et al.  Modeling within-motif dependence for transcription factor binding site predictions , 2004, Bioinform..

[35]  Tao Jiang,et al.  Identifying transcription factor binding sites through Markov chain optimization , 2002, ECCB.

[36]  W A Scaringe,et al.  Reported in vivo splice‐site mutations in the factor IX gene: Severity of splicing defects and a hypothesis for predicting deleterious splice donor mutations , 1999, Human mutation.

[37]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[38]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[39]  G. Stormo,et al.  Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. , 2001, Nucleic acids research.

[40]  T. Werner,et al.  MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. , 1995, Nucleic acids research.

[41]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[42]  D. S. Fields,et al.  Specificity, free energy and information content in protein-DNA interactions. , 1998, Trends in biochemical sciences.

[43]  Jorma Rissanen,et al.  Complexity of strings in the class of Markov sources , 1986, IEEE Trans. Inf. Theory.

[44]  Sorin Istrail,et al.  Proceedings of the second annual international conference on Computational molecular biology , 1998, RECOMB 1998.