Segmentation of yeast DNA using hidden Markov models

MOTIVATION Compositionally homogeneous segments of genomic DNA often correspond to meaningful biological units. Simple sliding window analysis is usually insufficient for compositional segmentation of natural sequences. Hidden Markov models (HMM) with a small number of states are a natural language for description of compositional properties of chromosome-size DNA sequences. RESULTS The algorithms were applied to yeast Saccharomyces cerevisiae chromosomes (YC) I, III, IV, VI and IX. The optimal number of HMM states is found to be four. The optimal four-state HMMs for all chromosomes are very similar, as well as the reconstructed segmentations. In most cases the models with k + 1 states are obtained by 'splitting' one of the states in the model with k states, and the corresponding increase of the level of detail in segmentation. The high AT states usually correspond to intergenic regions. We also explore the model's likelihood landscape and analyze the dynamics of the optimization process, thus addressing the problem of reliability of the obtained optima and efficiency of the algorithms.

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  A. Bird CpG-rich islands and the function of DNA methylation , 1986, Nature.

[3]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[4]  G. Goodall,et al.  The AU-rich sequences present in the introns of plant nuclear pre-mRNAs are required for splicing , 1989, Cell.

[5]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[6]  David R. Wolf,et al.  Base compositional structure of genomes. , 1992, Genomics.

[7]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[8]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[9]  P. Baldi,et al.  Naturally occurring nucleosome positioning signals in human exons and introns. , 1996, Journal of molecular biology.

[10]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[11]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[12]  Steven Salzberg,et al.  Finding Genes in DNA with a Hidden Markov Model , 1997, J. Comput. Biol..

[13]  Jean Garnier,et al.  Incorporating Global Information into Secondary Structure Prediction with Hidden Markov Models of Protein Folds , 1997, ISMB.

[14]  K. Roeder,et al.  A statistical model for locating regulatory regions in genomic DNA. , 1997, Journal of molecular biology.

[15]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[16]  Leslie Pack Kaelbling,et al.  Learning models for robot navigation , 1999 .