A Gibbs sampler for the detection of subtle motifs in multiple sequences

We describe a statistically based algorithm that aligns protein sequences by means of predictive inference. Using residue frequencies, this Gibbs sampling algorithm iteratively selects alignments in accordance with their conditional probabilities. The newly formed alignments in turn update an evolving residue frequency model. When equilibrium is reached the most probable alignment can be identified. If a detectable pattern is present, generally convergence is rapid. Effectively, the algorithm finds optimal local multiple alignments in linear time (seconds on current workstations). Its use is illustrated on test sets of lipocalins and prenyltranferases.<<ETX>>

[1]  G D Schuler,et al.  A workbench for multiple alignment construction and analysis , 1991, Proteins.

[2]  F. Pohl Empirical protein energy maps. , 1971, Nature: New biology.

[3]  Terri K. Attwood,et al.  SERPENT - an information storage and analysis resource for protein sequences , 1992, Comput. Appl. Biosci..

[4]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[5]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[6]  M S Boguski,et al.  Analysis of conserved domains and sequence motifs in cellular regulatory proteins and locus control regions using new software tools for multiple alignment and visualization. , 1992, The New biologist.

[7]  T A Jones,et al.  Crystallographic refinement of human serum retinol binding protein at 2Å resolution , 1990, Proteins.

[8]  G. Stormo,et al.  Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. , 1992, Journal of molecular biology.

[9]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[10]  P. Argos,et al.  Motif recognition and alignment for many sequences by comparison of dot-matrices. , 1991, Journal of molecular biology.

[11]  Kim-Hung Li,et al.  Imputation using Markov chains , 1988 .

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  D. Bacon,et al.  Multiple Sequence Alignment , 1986, Journal of molecular biology.

[14]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[15]  R J Roberts,et al.  Predictive motifs derived from cytosine methyltransferases. , 1989, Nucleic acids research.

[16]  S. Bryant,et al.  An empirical energy function for threading protein sequence through the folding motif , 1993, Proteins.

[17]  Rodger Staden,et al.  Methods for discovering novel motifs in nucleic acid sequences , 1989, Comput. Appl. Biosci..

[18]  A. Murray,et al.  Novel repetitive sequence motifs in the alpha and beta subunits of prenyl-protein transferases and homology of the alpha subunit to the MAD2 gene product of yeast. , 1992, The New biologist.

[19]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[20]  T. Attwood,et al.  Structure and sequence relationships in the lipocalins and related proteins , 1993, Protein science : a publication of the Protein Society.

[21]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[22]  S. Bryant,et al.  The frequency of ion‐pair substructures in proteins is quantitatively related to electrostatic potential: A statistical model for nonbonded interactions , 1991, Proteins.

[23]  M. Yudkin,et al.  The prediction of helix-turn-helix DNA-binding regions in proteins. , 1987, Protein engineering.

[24]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[25]  M. Kendall,et al.  The advanced theory of statistics , 1945 .

[26]  Cary Queen,et al.  Improvements to a program for DNA analysis: a procedure to find homologies among many sequences , 1982, Nucleic Acids Res..

[27]  D. Haussler,et al.  Protein modeling using hidden Markov models: analysis of globins , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[28]  M. Woodbury A missing information principle: theory and applications , 1972 .

[29]  Hamilton O. Smith,et al.  Finding sequence motifs in groups of functionally related proteins. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[30]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[31]  Adrian F. M. Smith,et al.  Bayesian computation via the gibbs sampler and related markov chain monte carlo methods (with discus , 1993 .

[32]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[33]  L. A. Goodman Exploratory latent structure analysis using both identifiable and unidentifiable models , 1974 .

[34]  S. Clarke,et al.  Protein isoprenylation and methylation at carboxyl-terminal cysteine residues. , 1992, Annual review of biochemistry.