Finding the biologically optimal alignment of multiple sequences

OBJECTIVE Deterministic annealing, which is derived from statistical physics, is a method for obtaining the global optimum in parameter space. During the annealing process, starting from high temperatures which are then lowered, deterministic annealing deterministically find the (global) optimum at each temperature. Thus, deterministic annealing is expected to be more computationally efficient than stochastic sampling strategies to obtain the global optimum. We propose to apply the deterministic annealing technique to the problem of efficiently finding the biologically optimal alignment of multiple sequences. METHODS AND MATERIAL We take a strategy based on probabilistic models for aligning multiple sequences. That is, we train a probabilistic model using given training sequences and obtain their alignment by parsing, i.e. searching for the most likely parse of each sequence and gaps using the trained parameters of the model. In this scenario, we propose a new stochastic model, which is simple enough to be suited to multiple sequence alignment and, unlike existing stochastic models, say a profile hidden Markov model (HMM), allows us to use similarity scores between symbols (or a symbol and a gap). We further present a learning algorithm for our simple model by combining deterministic annealing with an expectation-maximization (EM) algorithm. We emphasize that our approach is time-efficient, even if the training is done through an annealing process. RESULTS In our experiments, we used actual protein sequences whose three-dimensional (3D) structures are determined and which are all aligned based on their 3D structures. We compared the results obtained by our approach with those by other existing approaches. Experimental results clearly showed that our approach gave the best performance, in terms of the similarity to the structurally determined alignment, among the approaches tested. Experimental results further indicated that our approach was ten times more efficient in terms of actual computation time than a competing method.

[1]  Gregory R. Grant,et al.  Statistical Methods in Bioinformatics , 2001 .

[2]  Bin Ma,et al.  Near optimal multiple alignment within a band in polynomial time , 2000, STOC '00.

[3]  Dani Gamerman,et al.  Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference , 1997 .

[4]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[7]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[8]  Eric C. Rouchka,et al.  Gibbs Recursive Sampler: finding transcription factor binding sites , 2003, Nucleic Acids Res..

[9]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[10]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[11]  Hugh B Nicholas,et al.  Strategies for multiple sequence alignment. , 2002, BioTechniques.

[12]  Bradley P. Carlin,et al.  Markov Chain Monte Carlo in Practice: A Roundtable Discussion , 1998 .

[13]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[14]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[15]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[16]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[17]  L. Holm,et al.  Exhaustive enumeration of protein domain families. , 2003, Journal of molecular biology.

[18]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[19]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.