Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model

BackgroundCertain protein families are highly conserved across distantly related organisms and belong to large and functionally diverse superfamilies. The patterns of conservation present in these protein sequences presumably are due to selective constraints maintaining important but unknown structural mechanisms with some constraints specific to each family and others shared by a larger subset or by the entire superfamily. To exploit these patterns as a source of functional information, we recently devised a statistically based approach called c ontrast h ierarchical a lignment and i nteraction n etwork (CHAIN) analysis, which infers the strengths of various categories of selective constraints from co-conserved patterns in a multiple alignment. The power of this approach strongly depends on the quality of the multiple alignments, which thus motivated development of theoretical concepts and strategies to improve alignment of conserved motifs within large sets of distantly related sequences.ResultsHere we describe a hidden Markov model (HMM), an algebraic system, and Markov chain Monte Carlo (MCMC) sampling strategies for alignment of multiple sequence motifs. The MCMC sampling strategies are useful both for alignment optimization and for adjusting position specific background amino acid frequencies for alignment uncertainties. Associated statistical formulations provide an objective measure of alignment quality as well as automatic gap penalty optimization. Improved alignments obtained in this way are compared with PSI-BLAST based alignments within the context of CHAIN analysis of three protein families: Giαsubunits, prolyl oligopeptidases, and transitional endoplasmic reticulum (p97) AAA+ ATPases.ConclusionWhile not entirely replacing PSI-BLAST based alignments, which likewise may be optimized for CHAIN analysis using this approach, these motif-based methods often more accurately align very distantly related sequences and thus can provide a better measure of selective constraints. In some instances, these new approaches also provide a better understanding of family-specific constraints, as we illustrate for p97 ATPases. Programs implementing these procedures and supplementary information are available from the authors.

[1]  Changcheng Song,et al.  Molecular perspectives on p97-VCP: progress in understanding its structure and diverse biological functions. , 2004, Journal of structural biology.

[2]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[3]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[4]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[5]  F. Confalonieri,et al.  A 200‐amino acid ATPase module in search of a basic function , 1995, BioEssays : news and reviews in molecular, cellular and developmental biology.

[6]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[7]  Andrew F Neuwald Evolutionary clues to DNA polymerase III beta clamp structural mechanisms. , 2003, Nucleic acids research.

[8]  M. Latterich,et al.  The AAA team: related ATPases with diverse functions. , 1998, Trends in cell biology.

[9]  Joel L. Sussman,et al.  The α/β hydrolase fold , 1992 .

[10]  Pablo G. Debenedetti,et al.  Supercooled liquids and the glass transition , 2001, Nature.

[11]  Tim Hesterberg,et al.  Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[12]  Jun S. Liu,et al.  Ran's C-terminal, basic patch, and nucleotide exchange mechanisms in light of a canonical structure for Rab, Rho, Ras, and Ran GTPases. , 2003, Genome research.

[13]  Karsten Melcher,et al.  A highly conserved ATPase protein as a mediator between acidic activation domains and the TATA-binding protein , 1995, Nature.

[14]  Olivier Poch,et al.  BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations , 2001, Nucleic Acids Res..

[15]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[16]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[17]  Detlef D. Leipe,et al.  Evolutionary history and higher order classification of AAA+ ATPases. , 2004, Journal of structural biology.

[18]  A. F. Neuwald,et al.  HEAT repeats associated with condensins, cohesins, and other complexes involved in chromosome-related functions. , 2000, Genome research.

[19]  M. Nardini,et al.  α/β Hydrolase fold enzymes : the family keeps growing , 1999 .

[20]  S F Altschul,et al.  Generalized affine gap costs for protein sequence alignment , 1998, Proteins.

[21]  Jun S. Liu,et al.  Gibbs motif sampling: Detection of bacterial outer membrane protein repeats , 1995, Protein science : a publication of the Protein Society.

[22]  Jun S. Liu,et al.  Markovian structures in biological sequence alignments , 1999 .

[23]  M. Nardini,et al.  Alpha/beta hydrolase fold enzymes: the family keeps growing. , 1999, Current opinion in structural biology.

[24]  Andrew F. Neuwald,et al.  Evolutionary clues to DNA polymerase III β clamp structural mechanisms , 2003 .

[25]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[26]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[27]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[28]  M van Heel,et al.  Structure of the AAA ATPase p97. , 2000, Molecular cell.

[29]  A. Wilkinson,et al.  AAA+ superfamily ATPases: common structure–diverse function , 2001, Genes to cells : devoted to molecular & cellular mechanisms.

[30]  Andrew F Neuwald,et al.  Evolutionary constraints associated with functional specificity of the CMGC protein kinases MAPK, CDK, GSK, SRPK, DYRK, and CK2α , 2004, Protein science : a publication of the Protein Society.

[31]  S J Remington,et al.  The alpha/beta hydrolase fold. , 1992, Protein engineering.

[32]  A. F. Neuwald,et al.  PSI-BLAST searches using hidden markov models of structural repeats: prediction of an unusual sliding DNA clamp and of beta-propellers in UV-damaged DNA-binding protein. , 2000, Nucleic acids research.

[33]  D. Lipman,et al.  Extracting protein alignment models from the sequence database. , 1997, Nucleic acids research.

[34]  E V Koonin,et al.  AAA+: A class of chaperone-like ATPases associated with the assembly, operation, and disassembly of protein complexes. , 1999, Genome research.

[35]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[36]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.