Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.

A wealth of protein and DNA sequence data is being generated by genome projects and other sequencing efforts. A crucial barrier to deciphering these sequences and understanding the relations among them is the difficulty of detecting subtle local residue patterns common to multiple sequences. Such patterns frequently reflect similar molecular structures and biological properties. A mathematical definition of this "local multiple alignment" problem suitable for full computer automation has been used to develop a new and sensitive algorithm, based on the statistical method of iterative sampling. This algorithm finds an optimized local alignment model for N sequences in N-linear time, requiring only seconds on current workstations, and allows the simultaneous detection and optimization of multiple patterns and pattern repeats. The method is illustrated as applied to helix-turn-helix proteins, lipocalins, and prenyltransferases.

[1]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[2]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[3]  L. A. Goodman Exploratory latent structure analysis using both identifiable and unidentifiable models , 1974 .

[4]  H. Akaike A new look at the statistical model identification , 1974 .

[5]  John Aitchison,et al.  Statistical Prediction Analysis , 1975 .

[6]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[7]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[8]  A. Lesk,et al.  How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. , 1980, Journal of molecular biology.

[9]  J. Richardson,et al.  The anatomy and taxonomy of protein structure. , 1981, Advances in protein chemistry.

[10]  Cary Queen,et al.  Improvements to a program for DNA analysis: a procedure to find homologies among many sequences , 1982, Nucleic Acids Res..

[11]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[12]  C. Chothia Principles that determine the structure of proteins. , 1984, Annual review of biochemistry.

[13]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  M. Waterman,et al.  Pattern recognition in several sequences: consensus and alignment. , 1984, Bulletin of mathematical biology.

[15]  S. Altschul,et al.  Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. , 1985, Molecular biology and evolution.

[16]  J. Richardson,et al.  Simultaneous comparison of three protein sequences. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[17]  W. Bains,et al.  MULTAN: a program to align multiple DNA sequences , 1986, Nucleic Acids Res..

[18]  P. Kraulis,et al.  The structure of β-lactoglobulin and its similarity to plasma retinol-binding protein , 1986, Nature.

[19]  H. M. Martinez,et al.  A multiple sequence alignment program , 1986, Nucleic Acids Res..

[20]  D. Bacon,et al.  Multiple Sequence Alignment , 1986, Journal of molecular biology.

[21]  L. Sawyer One fold among many , 1987, Nature.

[22]  M. Yudkin,et al.  The prediction of helix-turn-helix DNA-binding regions in proteins. , 1987, Protein engineering.

[23]  M. Sternberg,et al.  A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. , 1987, Journal of molecular biology.

[24]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[25]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[26]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[27]  R. Huber,et al.  Molecular structure of the bilin binding protein (BBP) from Pieris brassicae after refinement at 2.0 A resolution. , 1987, Journal of molecular biology.

[28]  I. Dodd,et al.  The prediction of helix-turn-helix DNA-binding regions in proteins. A reply to Yudkin. , 1988, Protein engineering.

[29]  A. Contreras,et al.  The effect on the function of the transcriptional activator NtrC from Klebsiella pneumoniae of mutations in the DNA-recognition helix. , 1988, Nucleic acids research.

[30]  R. Rolfes,et al.  Escherichia coli gene purR encoding a repressor protein for purine nucleotide synthesis. Cloning, nucleotide sequence, and interaction with the purF operator. , 1988, The Journal of biological chemistry.

[31]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[32]  S F Altschul,et al.  Weights for data related by a tree. , 1989, Journal of molecular biology.

[33]  R J Roberts,et al.  Predictive motifs derived from cytosine methyltransferases. , 1989, Nucleic acids research.

[34]  J. Millonig,et al.  Mutations that alter the helix‐turn‐helix region of the spollAC protein: a Bacillus subtilis sporulation‐specific sigma factor , 1989, Molecular microbiology.

[35]  C. Walsh,et al.  Transcriptional switching by the MerR protein: activation and repression mutants implicate distinct DNA and mercury(II) binding domains. , 1989, Biochemistry.

[36]  B. Matthews,et al.  The helix-turn-helix DNA binding motif. , 1989, The Journal of biological chemistry.

[37]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[38]  A. North,et al.  Three-dimensional arrangement of conserved amino acid residues in a superfamily of specific ligand-binding proteins. , 1989, International journal of biological macromolecules.

[39]  Rodger Staden,et al.  Methods for discovering novel motifs in nucleic acid sequences , 1989, Comput. Appl. Biosci..

[40]  Martin Vingron,et al.  A fast and sensitive multiple sequence alignment algorithm , 1989, Comput. Appl. Biosci..

[41]  R. Schleif,et al.  Determining residue-base interactions between AraC protein and araI DNA. , 1989, Journal of molecular biology.

[42]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[43]  The amino-terminal domain of LexA repressor is alpha-helical but differs from canonical helix-turn-helix proteins: a two-dimensional 1H NMR study. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[44]  C. M. Henneke,et al.  A multiple sequence alignment algorithm for homologous proteins using secondary structure information and optionally keying alignments to functionally important sites , 1989, Comput. Appl. Biosci..

[45]  D. Lipman,et al.  Trees, stars, and multiple biological sequence alignment , 1989 .

[46]  S Subbiah,et al.  A method for multiple sequence alignment with gaps. , 1989, Journal of molecular biology.

[47]  A. Contreras,et al.  The function of isolated domains and chimaeric proteins constructed from the transcriptional activators NifA and NtrC of Klebsiella pneumoniae , 1990, Molecular microbiology.

[48]  M. Schell,et al.  Use of saturation mutagenesis to localize probable functional domains in the NahR protein, a LysR-type transcription activator. , 1990, The Journal of biological chemistry.

[49]  I. Dodd,et al.  Improved detection of helix-turn-helix DNA-binding motifs in protein sequences. , 1990, Nucleic acids research.

[50]  Hamilton O. Smith,et al.  Finding sequence motifs in groups of functionally related proteins. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[51]  S. Busby,et al.  Interconversion of the DNA‐binding specificities of two related transcription regulators, CRP and FNR , 1990, Molecular microbiology.

[52]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[53]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[54]  M. Sternberg,et al.  Flexible protein sequence patterns. A sensitive method to detect weak structural similarities. , 1990, Journal of molecular biology.

[55]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[56]  W R Taylor,et al.  Hierarchical method to align large numbers of biological sequences. , 1990, Methods in enzymology.

[57]  T. Blundell,et al.  Definition of general topological equivalence in protein structures. A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. , 1990, Journal of molecular biology.

[58]  T A Jones,et al.  Crystallographic refinement of human serum retinol binding protein at 2Å resolution , 1990, Proteins.

[59]  Philippe Dessen,et al.  MASH: an interactive program for multiple alignment and consensus sequence construction for biological sequences , 1991, Comput. Appl. Biosci..

[60]  P. Argos,et al.  Motif recognition and alignment for many sequences by comparison of dot-matrices. , 1991, Journal of molecular biology.

[61]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[62]  M. Gribskov,et al.  Sequence Analysis Primer , 1991 .

[63]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[64]  H. Toh,et al.  Human brain prostaglandin D synthase has been evolutionarily differentiated from lipophilic-ligand carrier proteins. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[65]  Peter J. Munson,et al.  A novel randomized iterative strategy for aligning multiple protein sequences , 1991, Comput. Appl. Biosci..

[66]  S Karlin,et al.  An efficient algorithm for identifying matches with errors in multiple long molecular sequences. , 1991, Journal of molecular biology.

[67]  The molecular structure of wild-type and a mutant Fis protein: relationship between mutational changes and recombinational enhancer function or DNA binding. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[68]  Terri K. Attwood,et al.  SOMAP: a novel interactive approach to multiple protein sequences alignment , 1991, Comput. Appl. Biosci..

[69]  G D Schuler,et al.  A workbench for multiple alignment construction and analysis , 1991, Proteins.

[70]  T. K. Attwood,et al.  ADSP - a new package for computational sequence analysis , 1992, Comput. Appl. Biosci..

[71]  P Bork,et al.  An ATPase domain common to prokaryotic cell cycle proteins, sugar kinases, actin, and hsp70 heat shock proteins. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[72]  S. Short,et al.  Amino acid substitutions in the CytR repressor which alter its capacity to regulate gene expression , 1992, Journal of bacteriology.

[73]  R. Sauer,et al.  Transcription factors: structural families and principles of DNA recognition. , 1992, Annual review of biochemistry.

[74]  N. N. Alexandrov Local multiple alignment by consensus matrix , 1992, Comput. Appl. Biosci..

[75]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[76]  Mikhail A. Roytberg A search for common patterns in many sequences , 1992, Comput. Appl. Biosci..

[77]  C. Desplan,et al.  The homeodomain: A new face for the helix‐turn‐helix? , 1992, BioEssays : news and reviews in molecular, cellular and developmental biology.

[78]  S. Clarke,et al.  Protein isoprenylation and methylation at carboxyl-terminal cysteine residues. , 1992, Annual review of biochemistry.

[79]  M S Boguski,et al.  Analysis of conserved domains and sequence motifs in cellular regulatory proteins and locus control regions using new software tools for multiple alignment and visualization. , 1992, The New biologist.

[80]  Ernest Feytmans,et al.  MATCH-BOX: a fundamentally new algorithm for the simultaneous alignment of several protein sequences , 1992, Comput. Appl. Biosci..

[81]  Rainer Fuchs,et al.  CLUSTAL V: improved software for multiple sequence alignment , 1992, Comput. Appl. Biosci..

[82]  W. Hillen,et al.  The role of the N terminus in Tet repressor for tet operator binding determined by a mutational analysis. , 1992, The Journal of biological chemistry.

[83]  G. Stormo,et al.  Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. , 1992, Journal of molecular biology.

[84]  H. Monaco,et al.  Three‐dimensional structure and active site of three hydrophobic molecule‐binding proteins with significant amino acid sequence similarity , 1992, Biopolymers.

[85]  Terri K. Attwood,et al.  SERPENT - an information storage and analysis resource for protein sequences , 1992, Comput. Appl. Biosci..

[86]  G. Barton,et al.  Multiple protein sequence alignment from tertiary structure comparison: Assignment of global and residue confidence levels , 1992, Proteins.

[87]  A. North,et al.  Pheromone binding to two rodent urinary proteins revealed by X-ray crystallography , 1992, Nature.

[88]  W. Saenger,et al.  Crystal structure of the factor for inversion stimulation FIS at 2.0 A resolution. , 1992, Journal of molecular biology.

[89]  S. Bryant,et al.  An empirical energy function for threading protein sequence through the folding motif , 1993, Proteins.

[90]  M Vingron,et al.  Weighting in sequence space: a comparison of methods in terms of generalized sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[91]  T. Attwood,et al.  Structure and sequence relationships in the lipocalins and related proteins , 1993, Protein science : a publication of the Protein Society.

[92]  D. Gusfield Efficient methods for multiple sequence alignment with guaranteed error bounds , 1993 .

[93]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..