Finding Short DNA Motifs Using Permuted Markov Models

Many short DNA motifs, such as transcription factor binding sites (TFBS) and splice sites, exhibit strong local as well as nonlocal dependence. We introduce permuted variable length Markov models (PVLMM) which could capture the potentially important dependencies among positions and apply them to the problem of detecting splice and TFB sites. They have been satisfactory from the viewpoint of prediction performance and also give ready biological interpretations of the sequence dependence observed. The issue of model selection is also studied.

[1]  Jorma Rissanen,et al.  Complexity of strings in the class of Markov sources , 1986, IEEE Trans. Inf. Theory.

[2]  Christopher B. Burge,et al.  Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals , 2004, J. Comput. Biol..

[3]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[4]  Xin Chen,et al.  TRANSFAC: an integrated system for gene expression regulation , 2000, Nucleic Acids Res..

[5]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[6]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[7]  Simon Kasif,et al.  Modeling splice sites with Bayes networks , 2000, Bioinform..

[8]  Victor V. Solovyev,et al.  SpliceDB: database of canonical and non-canonical mammalian splice sites , 2001, Nucleic Acids Res..

[9]  Peter Bühlmann,et al.  Model Selection for Variable Length Markov Chains and Tuning the Context Algorithm , 2000 .

[10]  Frederick Jelinek,et al.  A study of n-gram and decision tree letter language modeling methods , 1998, Speech Commun..

[11]  Wing Hung Wong,et al.  Determination of Local Statistical Significance of Patterns in Markov Sequences with Application to Promoter Element Identification , 2004, J. Comput. Biol..

[12]  H. Akaike A new look at the statistical model identification , 1974 .

[13]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[14]  D. Haussler,et al.  Genie--gene finding in Drosophila melanogaster. , 2000, Genome research.

[15]  Simon Cawley,et al.  Applications of generalized pair hidden Markov models to alignment and gene finding problems. , 2002 .

[16]  H. Chipman,et al.  Bayesian CART Model Search , 1998 .

[17]  B. Berger,et al.  Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction , 2000 .

[18]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[19]  L. Chasin,et al.  Human Genomic Sequences That Inhibit Splicing , 2000, Molecular and Cellular Biology.

[20]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[21]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[22]  Steve Baker,et al.  Integrated gene and species phylogenies from unaligned whole genome protein sequences , 2002, Bioinform..

[23]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[24]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[25]  C. Guthrie,et al.  Mechanical Devices of the Spliceosome: Motors, Clocks, Springs, and Things , 1998, Cell.

[26]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[27]  Phillip A Sharp,et al.  Predictive Identification of Exonic Splicing Enhancers in Human Genes , 2002, Science.

[28]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[29]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..