论文信息 - Efficient Updating of Biological Sequence Analyses

Efficient Updating of Biological Sequence Analyses

We present a novel approach for reducing the computational complexity of updating homologies produced by a wide class of popular state-of-the-art algorithms in comparative computational biology. The algorithms that we consider use hidden Markov models (HMMs) and a Viterbi recursion to evaluate matches between sequences, or between a sequence and models. Such updates occur frequently in practice as researchers discover errors in biological sequences or analyze multiple nearly similar sequences, e.g., in a family of proteins that underwent mutations during evolution. The proposed algorithm interprets the Viterbi recursion as an update of an optimal minimum spanning tree in a shortest path problem. We propose the novel concept of a relative node tolerance bound and show how it can be used to guarantee that one or more partial subtrees of a minimum spanning tree obtained before encountering the perturbations remain optimal. We also describe how to compute and use in real-time the relative node tolerance bounds to skip most unperturbed parts of a sequence while computing the new optimal solution. To further reduce the computational overhead associated with the tolerance bound evaluation, we present and exploit a statistical analysis of the matching procedure that estimates how many columns in the dynamic program that corresponds to the matching problem are affected by a change in a preceding column. The resulting "reusable" Viterbi decoding algorithm can update a matching result in less than a third to a fifth of the time required to compute a new match by performing a normal matching procedure, i.e., running a Viterbi algorithm with updated sequences against a base hidden Markov model.

Ahmed H. Tewfik | Changjin Hong

[1] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[2] Aniruddha Datta,et al. Genomic signal processing: diagnosis and therapy , 2005, IEEE Signal Process. Mag..

[3] David Hung-Chang Du,et al. Handling updates of a biological sequence based on Hidden Markov Models , 2005, 2005 13th European Signal Processing Conference.

[4] Lawrence R. Rabiner,et al. A tutorial on Hidden Markov Models , 1986 .

[5] R. Schwartz,et al. The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[6] M. Zuker. Suboptimal sequence alignment in molecular biology. Alignment with error analysis. , 1991, Journal of molecular biology.

[7] Robert E. Tarjan,et al. Sensitivity Analysis of Minimum Spanning Trees and Shortest Path Trees , 1982, Inf. Process. Lett..

[8] A. Clark,et al. Sequencing errors and molecular evolutionary analysis. , 1992, Molecular biology and evolution.

[9] Philipp L. Wesche,et al. DNA Sequence Error Rates in Genbank Records Estimated using the Mouse Genome as a Reference , 2004, DNA sequence : the journal of DNA sequencing and mapping.

[10] Jeremy Buhler,et al. Designing patterns for profile HMM search , 2007, Bioinform..

[11] Huai Li,et al. How will bioinformatics impact signal processing research , 2003 .

[12] Nils J. Nilsson,et al. Problem-solving methods in artificial intelligence , 1971, McGraw-Hill computer science series.

[13] Douglas R. Shier,et al. Arc tolerances in shortest path and network flow problems , 1980, Networks.

[14] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[15] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[16] Juliette Martin,et al. Analysis of an optimal hidden Markov model for secondary structure prediction , 2006, BMC Structural Biology.