HMM sampling and applications to gene finding and alternative splicing

The standard method of applying hidden Markov models to biological problems is to find a Viterbi (maximal weight) path through the HMM graph. The Viterbi algorithm reduces the problem of finding the most likely hidden state sequence that explains given observations, to a dynamic programming problem for corresponding directed acyclic graphs. For example, in the gene finding application, the HMM is used to find the most likely underlying gene structure given a DNA sequence. In this note we discuss the applications of sampling methods for HMMs. The standard sampling algorithm for HMMs is a variant of the common forward-backward and backtrack algorithms, and has already been applied in the context of Gibbs sampling methods. Nevetheless, the practice of sampling state paths from HMMs does not seem to have been widely adopted, and important applications have been overlooked. We show how sampling can be used for finding alternative splicings for genes, including alternative splicings that are conserved between genes from related organisms. We also show how sampling from the posterior distribution is a natural way to compute probabilities for predicted exons and gene structures being correct under the assumed model. Finally, we describe a new memory efficient sampling algorithm for certain classes of HMMs which provides a practical sampling alternative to the Hirschberg algorithm for optimal alignment. The ideas presented have applications not only to gene finding and HMMs but more generally to stochastic context free grammars and RNA structure prediction.

[1]  P. Argos,et al.  Determination of reliable regions in protein sequence alignments. , 1990, Protein engineering.

[2]  Christopher J. Lee,et al.  A genomic view of alternative splicing , 2002, Nature Genetics.

[3]  M. Pollack Letter to the Editor—The kth Best Route Through a Network , 1961 .

[4]  M J Sternberg,et al.  A simple method to generate non-trivial alternate alignments of protein sequences. , 1991, Journal of molecular biology.

[5]  W. Gish,et al.  Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. , 2001, Genome research.

[6]  B. Fox Calculating Kth Shortest Paths , 1973 .

[7]  Richard Pavley,et al.  A Method for the Solution of the Nth Best Path Problem , 1959, JACM.

[8]  Dalit Naor,et al.  On Near-Optimal Alignments of Biological Sequences , 1994, J. Comput. Biol..

[9]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[10]  Douglas R. Shier,et al.  On algorithms for finding the k shortest paths in a network , 1979, Networks.

[11]  M S Waterman Parametric and ensemble sequence alignment algorithms. , 1994, Bulletin of mathematical biology.

[12]  Aarni Perko,et al.  Implementation of algorithms for K shortest loopless paths , 1986, Networks.

[13]  Michael S. Waterman,et al.  A dynamic programming algorithm to find all solutions in a neighborhood of the optimum , 1985 .

[14]  S. Clarke,et al.  Computing the N Best Loopless Paths in a Network , 1963 .

[15]  Jun Zhu,et al.  Bayesian adaptive sequence alignment algorithms , 1998, Bioinform..

[16]  E. Lawler A PROCEDURE FOR COMPUTING THE K BEST SOLUTIONS TO DISCRETE OPTIMIZATION PROBLEMS AND ITS APPLICATION TO THE SHORTEST PATH PROBLEM , 1972 .

[17]  D. Haussler,et al.  Genie--gene finding in Drosophila melanogaster. , 2000, Genome research.

[18]  M. Zuker Suboptimal sequence alignment in molecular biology. Alignment with error analysis. , 1991, Journal of molecular biology.

[19]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[20]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[21]  M. Sternberg,et al.  Towards an automatic method of predicting protein structure by homology: an evaluation of suboptimal sequence alignments. , 1992, Protein engineering.

[22]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[23]  Dan Gusfield,et al.  Parametric optimization of sequence alignment , 1992, SODA '92.

[24]  M S Waterman,et al.  Sequence alignments in the neighborhood of the optimum with general application to dynamic programming. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[25]  E. Birney,et al.  Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs , 2002, Nature.

[26]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[27]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[28]  E. Lander,et al.  Parametric sequence comparisons. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Dalit Naor,et al.  On Suboptimal Alignments of Biological Sequences , 1993, CPM.

[30]  Thomas Lengauer,et al.  Recursive Dynamic Programming for Adaptive Sequence and Structure Alignment , 1995, ISMB.

[31]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[32]  Robert E. Kalaba,et al.  ON THE K-TH BEST POLICIES , 1960 .