The Construction and Use of Log-Odds Substitution Scores for Multiple Sequence Alignment

Most pairwise and multiple sequence alignment programs seek alignments with optimal scores. Central to defining such scores is selecting a set of substitution scores for aligned amino acids or nucleotides. For local pairwise alignment, substitution scores are implicitly of log-odds form. We now extend the log-odds formalism to multiple alignments, using Bayesian methods to construct “BILD” (“Bayesian Integral Log-odds”) substitution scores from prior distributions describing columns of related letters. This approach has been used previously only to define scores for aligning individual sequences to sequence profiles, but it has much broader applicability. We describe how to calculate BILD scores efficiently, and illustrate their uses in Gibbs sampling optimization procedures, gapped alignment, and the construction of hidden Markov model profiles. BILD scores enable automated selection of optimal motif and domain model widths, and can inform the decision of whether to include a sequence in a multiple alignment, and the selection of insertion and deletion locations. Other applications include the classification of related sequences into subfamilies, and the definition of profile-profile alignment scores. Although a fully realized multiple alignment program must rely upon more than substitution scores, many existing multiple alignment programs can be modified to employ BILD scores. We illustrate how simple BILD score based strategies can enhance the recognition of DNA binding domains, including the Api-AP2 domain in Toxoplasma gondii and Plasmodium falciparum.

[1]  Osamu Gotoh,et al.  A weighting system and algorithm for aligning many phylogenetically related sequences , 1995, Comput. Appl. Biosci..

[2]  Kenta Nakai,et al.  Pseudocounts for transcription factor binding sites , 2008, Nucleic acids research.

[3]  István Miklós,et al.  Bayesian coestimation of phylogeny and sequence alignment , 2005, BMC Bioinformatics.

[4]  Duncan P. Brown,et al.  Automated Protein Subfamily Identification and Classification , 2007, PLoS Comput. Biol..

[5]  Byungkook Lee,et al.  Frequency of gaps observed in a structurally aligned protein pair database suggests a simple gap penalty function. , 2004, Nucleic acids research.

[6]  Michael Kaufmann,et al.  BMC Bioinformatics BioMed Central , 2005 .

[7]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[8]  S. Sunyaev,et al.  PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. , 1999, Protein engineering.

[9]  M Vingron,et al.  Weighting in sequence space: a comparison of methods in terms of generalized sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[11]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[12]  P. Argos,et al.  Weighting aligned protein or nucleic acid sequences to correct for unequal representation. , 1990, Journal of molecular biology.

[13]  Richard Mott Local sequence alignments with monotonic gap penalties , 1999, Bioinform..

[14]  Jimin Pei,et al.  PCMA: fast and accurate multiple sequence alignment based on profile consistency , 2003, Bioinform..

[15]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Anders Krogh,et al.  Maximum Entropy Weighting of Aligned Sequences of Proteins or DNA , 1995, ISMB.

[17]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[18]  John P. Overington,et al.  Environment‐specific amino acid substitution tables: Tertiary templates and prediction of protein folds , 1992, Protein science : a publication of the Protein Society.

[19]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[20]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[21]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[22]  A. Godzik,et al.  Comparison of sequence profiles. Strategies for structural predictions using sequence information , 2008, Protein science : a publication of the Protein Society.

[23]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[24]  A. Fersht,et al.  Glutamine, alanine or glycine repeats inserted into the loop of a protein have minimal effects on stability and folding rates. , 1997, Journal of molecular biology.

[25]  Sven Rahmann,et al.  Non-symmetric score matrices and the detection of homologous transmembrane proteins , 2001, ISMB.

[26]  G. Gonnet,et al.  Empirical and structural models for insertions and deletions in the divergent evolution of proteins. , 1993, Journal of molecular biology.

[27]  Francesca Chiaromonte,et al.  Scoring Pairwise Genomic Sequence Alignments , 2001, Pacific Symposium on Biocomputing.

[28]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[29]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[30]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[31]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[32]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[33]  Benjamin J. Raphael,et al.  A novel method for multiple alignment of sequences with repeated and shuffled elements. , 2004, Genome research.

[34]  Mark Gerstein,et al.  Changes in Protein Evolution Appendix : A method to weight protein sequences to correct for unequal representation , 1999 .

[35]  Kimmen Sjölander,et al.  A comparison of scoring functions for protein sequence profile alignment , 2004, Bioinform..

[36]  M Kann,et al.  Optimization of a new score function for the detection of remote homologs , 2000, Proteins.

[37]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[38]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[39]  C. Chothia,et al.  Volume changes in protein evolution. , 1994, Journal of molecular biology.

[40]  R. Doolittle,et al.  Aligning amino acid sequences: Comparison of commonly used methods , 1985, Journal of Molecular Evolution.

[41]  Narmada Thanki,et al.  CDD: specific functional annotation with the Conserved Domain Database , 2008, Nucleic Acids Res..

[42]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[43]  Kimmen Sjölander,et al.  SATCHMO: Sequence Alignment and Tree Construction Using Hidden Markov Models , 2003, Bioinform..

[44]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[45]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[46]  S. Pietrokovski Searching databases of conserved sequence regions by aligning protein multiple-alignments. , 1996, Nucleic acids research.

[47]  Adam M. Novak,et al.  BigFoot: Bayesian alignment and phylogenetic footprinting with MCMC , 2009, BMC Evolutionary Biology.

[48]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[49]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[50]  Eric P Xing,et al.  MotifPrototyper: A Bayesian profile model for motif families , 2004, Proc. Natl. Acad. Sci. USA.

[51]  Julie Dawn Thompson,et al.  Improved sensitivity of profile searches through the use of sequence weights and gap excision , 1994, Comput. Appl. Biosci..

[52]  S. Altschul Gap costs for multiple sequence alignment. , 1989, Journal of theoretical biology.

[53]  Adam Prügel-Bennett,et al.  Training HMM structure with genetic algorithm for biological sequence analysis , 2004, Bioinform..

[54]  A. Dembo,et al.  Limit Distribution of Maximal Non-Aligned Two-Sequence Segmental Score , 1994 .

[55]  Tu Minh Phuong,et al.  Multiple alignment of protein sequences with repeats and rearrangements , 2006, Nucleic acids research.

[56]  Duncan P. Brown,et al.  Efficient functional clustering of protein sequences using the Dirichlet process , 2008, Bioinform..

[57]  Jorja G. Henikoff,et al.  PHAT: a transmembrane-specific substitution matrix , 2000, Bioinform..

[58]  Kimmen Sjölander,et al.  Phylogenetic Inference in Protein Superfamilies: Analysis of SH2 Domains , 1998, ISMB.

[59]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[60]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[61]  J. Richardson,et al.  Simultaneous comparison of three protein sequences. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[62]  Anders Krogh,et al.  Modeling promoter grammars with evolving hidden Markov models , 2008, Bioinform..

[63]  Jun S. Liu,et al.  Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model , 2004, BMC Bioinformatics.

[64]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[65]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[66]  Richa Agarwala,et al.  COBALT: constraint-based alignment tool for multiple protein sequences , 2007, Bioinform..

[67]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[68]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[69]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[70]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[71]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[72]  S. Altschul,et al.  Optimal sequence alignment using affine gap costs. , 1986, Bulletin of mathematical biology.

[73]  Anna R Panchenko,et al.  Finding weak similarities between proteins by sequence profile comparison. , 2003, Nucleic acids research.

[74]  K. Karrer,et al.  Homing Endonucleases Encoded by Germ Line-Limited Genes in Tetrahymena thermophila Have APETELA2 DNA Binding Domains , 2004, Eukaryotic Cell.

[75]  Lior Pachter,et al.  Fast Statistical Alignment , 2009, PLoS Comput. Biol..

[76]  P. Sellers Pattern recognition in genetic sequences by mismatch density , 1984 .

[77]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[78]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[79]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[80]  Akihiko Konagaya,et al.  Hidden Markov Models and Iterative Aligners: Study of Their Equivalence and Possibilities , 1993, ISMB.

[81]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..

[82]  S. Altschul,et al.  Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices , 1991 .

[83]  J. Risler,et al.  Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. , 1988, Journal of molecular biology.

[84]  SödingJohannes Protein homology detection by HMM--HMM comparison , 2005 .

[85]  X Zhang,et al.  Stochastic heuristic algorithms for target motif identification (extended abstract). , 2000, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[86]  M. Madan Babu,et al.  Discovery of the principal specific transcription factors of Apicomplexa and their implication for the evolution of the AP2-integrase DNA binding domains , 2005, Nucleic acids research.

[87]  Michael Kaufmann,et al.  DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment , 2008, Algorithms for Molecular Biology.

[88]  Alejandro A. Schäffer,et al.  PSI-BLAST pseudocounts and the minimum description length principle , 2008, Nucleic acids research.

[89]  Roland L Dunbrack,et al.  Scoring profile‐to‐profile sequence alignments , 2004, Protein science : a publication of the Protein Society.

[90]  Andrew R. Gehrke,et al.  Specific DNA-binding by Apicomplexan AP2 transcription factors , 2008, Proceedings of the National Academy of Sciences.

[91]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[92]  Martin Tompa,et al.  An algorithm for finding novel gapped motifs in DNA sequences , 1998, RECOMB '98.

[93]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[94]  E. Myers,et al.  Sequence comparison with concave weighting functions. , 1988, Bulletin of mathematical biology.

[95]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[96]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[97]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[98]  T Yada,et al.  Extraction of hidden Markov model representations of signal patterns in DNA sequences. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[99]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[100]  Masao Yuda,et al.  Identification of a transcription factor in the mosquito‐invasive stage of malaria parasites , 2009, Molecular microbiology.

[101]  J. Mohana Rao New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. , 1987, International journal of peptide and protein research.

[102]  Michael Gribskov,et al.  The Megaprior Heuristic for Discovering Protein Sequence Patterns , 1996, ISMB.

[103]  S. Bryant,et al.  The identification of complete domains within protein sequences using accurate E-values for semi-global alignment , 2007, Nucleic acids research.

[104]  T. Smith,et al.  Optimal sequence alignments. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[105]  Sarah Hake,et al.  From Endonucleases to Transcription Factors: Evolution of the AP2 DNA Binding Domain in Plantsw⃞ , 2004, The Plant Cell Online.

[106]  S. Altschul,et al.  The compositional adjustment of amino acid substitution matrices , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[107]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[108]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[109]  Manuel Llinás,et al.  Structural determinants of DNA binding by a P. falciparum ApiAP2 transcriptional regulator. , 2010, Journal of molecular biology.

[110]  S F Altschul,et al.  Weights for data related by a tree. , 1989, Journal of molecular biology.

[111]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[112]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[113]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[114]  Byungkook Lee,et al.  Context‐specific amino acid substitution matrices and their use in the detection of protein homologs , 2008, Proteins.

[115]  John C. Wootton,et al.  Non-globular Domains in Protein Sequences: Automated Segmentation Using Complexity Measures , 1994, Comput. Chem..

[116]  Masashi Suzuki,et al.  A novel mode of DNA recognition by a β‐sheet revealed by the solution structure of the GCC‐box binding domain in complex with DNA , 1998, The EMBO journal.

[117]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[118]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[119]  David Baker,et al.  Low free energy cost of very long loop insertions in proteins , 2003, Protein science : a publication of the Protein Society.

[120]  Stephen F. Altschul,et al.  The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions , 2005, Bioinform..