Efficient Updating of Biological Sequence Analyses

We present a novel approach for reducing the computational complexity of updating homologies produced by a wide class of popular state-of-the-art algorithms in comparative computational biology. The algorithms that we consider use hidden Markov models (HMMs) and a Viterbi recursion to evaluate matches between sequences, or between a sequence and models. Such updates occur frequently in practice as researchers discover errors in biological sequences or analyze multiple nearly similar sequences, e.g., in a family of proteins that underwent mutations during evolution. The proposed algorithm interprets the Viterbi recursion as an update of an optimal minimum spanning tree in a shortest path problem. We propose the novel concept of a relative node tolerance bound and show how it can be used to guarantee that one or more partial subtrees of a minimum spanning tree obtained before encountering the perturbations remain optimal. We also describe how to compute and use in real-time the relative node tolerance bounds to skip most unperturbed parts of a sequence while computing the new optimal solution. To further reduce the computational overhead associated with the tolerance bound evaluation, we present and exploit a statistical analysis of the matching procedure that estimates how many columns in the dynamic program that corresponds to the matching problem are affected by a change in a preceding column. The resulting "reusable" Viterbi decoding algorithm can update a matching result in less than a third to a fifth of the time required to compute a new match by performing a normal matching procedure, i.e., running a Viterbi algorithm with updated sequences against a base hidden Markov model.

[1]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[2]  Aniruddha Datta,et al.  Genomic signal processing: diagnosis and therapy , 2005, IEEE Signal Process. Mag..

[3]  David Hung-Chang Du,et al.  Handling updates of a biological sequence based on Hidden Markov Models , 2005, 2005 13th European Signal Processing Conference.

[4]  Lawrence R. Rabiner,et al.  A tutorial on Hidden Markov Models , 1986 .

[5]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[6]  M. Zuker Suboptimal sequence alignment in molecular biology. Alignment with error analysis. , 1991, Journal of molecular biology.

[7]  Robert E. Tarjan,et al.  Sensitivity Analysis of Minimum Spanning Trees and Shortest Path Trees , 1982, Inf. Process. Lett..

[8]  A. Clark,et al.  Sequencing errors and molecular evolutionary analysis. , 1992, Molecular biology and evolution.

[9]  Philipp L. Wesche,et al.  DNA Sequence Error Rates in Genbank Records Estimated using the Mouse Genome as a Reference , 2004, DNA sequence : the journal of DNA sequencing and mapping.

[10]  Jeremy Buhler,et al.  Designing patterns for profile HMM search , 2007, Bioinform..

[11]  Huai Li,et al.  How will bioinformatics impact signal processing research , 2003 .

[12]  Nils J. Nilsson,et al.  Problem-solving methods in artificial intelligence , 1971, McGraw-Hill computer science series.

[13]  Douglas R. Shier,et al.  Arc tolerances in shortest path and network flow problems , 1980, Networks.

[14]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[15]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[16]  Juliette Martin,et al.  Analysis of an optimal hidden Markov model for secondary structure prediction , 2006, BMC Structural Biology.