A homology identification method that combines protein sequence and structure information

A new method is presented for identifying distantly related homologous proteins that are unrecognizable by conventional sequence comparison methods. The method combines information about functionally conserved sequence patterns with information about structure context. This information is encoded in stochastic discrete state‐space models (DSMs) that comprise a new family of hidden Markov models. The new models are called sequence‐pattern‐embedded DSMs (pDSMs). This method can identify distantly related protein family members with a high sensitivity and specificity. The method is illustrated with trypsin‐like serine proteases and globins. The strategy for building pDSMs is presented. The method has been validated using carefully constructed positive and negative control sets. In addition to the ability to recognize remote homologs, pDSM sequence analysis predicts secondary structures with higher sensitivity, specificity, and Q3 accuracy than DSM analysis, which omits information about conserved sequence patterns. The identification of trypsin‐like serine proteases in new genomes is discussed.

[1]  B. Hartley,et al.  Corrections to the amino acid sequence of bovine chymotrypsinogen A. , 1966, The Biochemical journal.

[2]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1977, Journal of molecular biology.

[3]  J. Garnier,et al.  Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. , 1978, Journal of molecular biology.

[4]  A. Lesk,et al.  How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. , 1980, Journal of molecular biology.

[5]  R. Doolittle Similar amino acid sequences: chance or common ancestry? , 1981, Science.

[6]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[7]  M. Pugnière,et al.  Peptide and ester synthesis in organic solvents catalyzed by seryl proteases linked to alumina , 1986, Proteins.

[8]  W. Taylor,et al.  Identification of protein sequence homology by consensus template alignment. , 1986, Journal of molecular biology.

[9]  A. Lesk,et al.  Determinants of a protein fold. Unique features of the globin amino acid sequences. , 1987, Journal of molecular biology.

[10]  Djordje Musil,et al.  The high-resolution X-ray crystal structure of the complex formed between subtilisin Carlsberg and eglin c, an elastase inhibitor from the leech Hirudo medicinalis Structural analysis, subtilisin structure and interface geometry , 1987 .

[11]  Richard H. Lathrop,et al.  ARIADNE: pattern-directed inference and hierarchical abstraction in protein structure recognition , 1987, CACM.

[12]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[13]  J. Wells,et al.  High-resolution epitope mapping of hGH-receptor interactions by alanine-scanning mutagenesis. , 1989, Science.

[14]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[15]  C. Georgopoulos,et al.  Identification, characterization, and mapping of the Escherichia coli htrA gene, whose product is essential for bacterial growth only at elevated temperatures , 1989, Journal of bacteriology.

[16]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[17]  J. Greer Comparative modeling methods: Application to the family of the mammalian serine proteases , 1990, Proteins.

[18]  A. Sloma,et al.  Isolation and characterization of a novel extracellular metalloprotease from Bacillus subtilis , 1990, Journal of bacteriology.

[19]  P. Argos,et al.  Scrutineer: a computer program that flexibly seeks and describes motifs and profiles in protein sequence databases [published erratum appears in Comput Appl Biosci 1990 Oct;6(4): 431] , 1990, Comput. Appl. Biosci..

[20]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[21]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[22]  Collin M. Stultz,et al.  Structural analysis based on state‐space modeling , 1993, Protein science : a publication of the Protein Society.

[23]  Marc Allaire,et al.  Picornaviral 3C cysteine proteinases have a fold similar to chymotrypsin-like serine proteinases , 1994, Nature.

[24]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[25]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank: current status. , 1994, Nucleic acids research.

[26]  A. Bairoch The ENZYME data bank. , 1993, Nucleic acids research.

[27]  Collin M. Stultz,et al.  Protein classification by stochastic modeling and optimal filtering of amino-acid sequences. , 1994, Mathematical Biosciences.

[28]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[29]  Chris Sander,et al.  DNA polymerase β belongs to an ancient nucleotidyltransferase superfamily , 1995 .

[30]  W. Bode,et al.  Structural features of a superfamily of zinc-endopeptidases: the metzincins. , 1995, Current opinion in structural biology.

[31]  Temple F. Smith,et al.  Multiple domain protein diagnostic patterns , 1996, Protein science : a publication of the Protein Society.

[32]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[33]  J M Thornton,et al.  Analysis of domain structural class using an automated class assignment protocol. , 1996, Journal of molecular biology.

[34]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[35]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[36]  Novel features of serine protease active sites and specificity pockets: sequence analysis and modelling studies of glutamate-specific endopeptidases and epidermolytic toxins. , 1996, Protein engineering.

[37]  David C. Jones,et al.  Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. , 1996, Journal of molecular biology.

[38]  Temple F. Smith,et al.  Global optimum protein threading with gapped alignment and empirical pair score functions. , 1996, Journal of molecular biology.

[39]  J. Garnier,et al.  Protein topology recognition from secondary structure sequences: application of the hidden Markov models to the alpha class proteins. , 1997, Journal of molecular biology.

[40]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[41]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[42]  S. Oliver,et al.  Erratum: Overview of the yeast genome , 1997, Nature.

[43]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[44]  T J Hubbard,et al.  New horizons in sequence analysis. , 1997, Current opinion in structural biology.

[45]  Richard H. Lathrop,et al.  Predicting Protein Structure With Probabilistic Models , 1997 .

[46]  William R. Pearson,et al.  Identifying distantly related protein sequences. , 1997, Computer applications in the biosciences : CABIOS.

[47]  C Sander,et al.  An evolutionary treasure: unification of a broad set of amidohydrolases related to urease , 1997, Proteins.

[48]  D. Lipman,et al.  Extracting protein alignment models from the sequence database. , 1997, Nucleic acids research.

[49]  C Sander,et al.  Predicting protein structure using hidden Markov models , 1997, Proteins.

[50]  A. Goffeau,et al.  The complete genome sequence of the Gram-positive bacterium Bacillus subtilis , 1997, Nature.