Pfam: A comprehensive database of protein domain families based on seed alignments

Databases of multiple sequence alignments are a valuable aid to protein sequence classification and analysis. One of the main challenges when constructing such a database is to simultaneously satisfy the conflicting demands of completeness on the one hand and quality of alignment and domain definitions on the other. The latter properties are best dealt with by manual approaches, whereas completeness in practice is only amenable to automatic methods. Herein we present a database based on hidden Markov model profiles (HMMs), which combines high quality and completeness. Our database, Pfam, consists of parts A and B. Pfam‐A is curated and contains well‐characterized protein domain families with high quality alignments, which are maintained by using manually checked seed alignments and HMMs to find and align all members. Pfam‐B contains sequence families that were generated automatically by applying the Domainer algorithm to cluster and align the remaining protein sequences after removal of Pfam‐A domains. By using Pfam, a large number of previously unannotated proteins from the Caenorhabditis elegans genome project were classified. We have also identified many novel family memberships in known proteins, including new kazal, Fibronectin type III, and response regulator receiver domains. Pfam‐A families have permanent accession numbers and form a library of HMMs available for searching and automatic annotation of new protein sequences. Proteins: 28:405–420, 1997. © 1997 Wiley‐Liss, Inc.

[1]  L. A. Kazal,et al.  Isolation of a crystalline trypsin inhibitor-anticoagulant protein from pancreas. , 1948, Journal of the American Chemical Society.

[2]  J. Devereux,et al.  A comprehensive set of sequence analysis programs for the VAX , 1984, Nucleic Acids Res..

[3]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Michael Gribskov,et al.  Profile scanning for three-dimensional structural patterns in protein sequences , 1988, Comput. Appl. Biosci..

[5]  P. R. Sibbald,et al.  The P-loop--a common motif in ATP- and GTP-binding proteins. , 1990, Trends in biochemical sciences.

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  J. Bazan,et al.  Structural design and molecular evolution of a cytokine receptor superfamily. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[8]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[9]  W R Taylor,et al.  Hierarchical method to align large numbers of biological sequences. , 1990, Methods in enzymology.

[10]  Tom L. Blundell,et al.  New protein fold revealed by a 2.3-Å resolution crystal structure of nerve growth factor , 1991, Nature.

[11]  John P. Overington Comparison of three-dimensional structures of homologous proteins , 1992 .

[12]  Rainer Fuchs,et al.  CLUSTAL V: improved software for multiple sequence alignment , 1992, Comput. Appl. Biosci..

[13]  P. Bork,et al.  Proposed acquisition of an animal protein domain by bacteria. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[14]  F. Winkler,et al.  Crystal structure of human platelet‐derived growth factor BB. , 1992, The EMBO journal.

[15]  P. Green,et al.  Ancient conserved regions in new gene sequences and the protein databases. , 1993, Science.

[16]  P. Bork The modular architecture of a new family of growth regulators related to connective tissue growth factor , 1993, FEBS letters.

[17]  M. Grütter,et al.  Refined crystal structure of human transforming growth factor beta 2 at 1.95 A resolution. , 1993, Journal of molecular biology.

[18]  M. Grütter,et al.  Refined Crystal Structure of Human Transforming Growth Factor β2 at 1·5 Å Resolution , 1993 .

[19]  T. Attwood,et al.  PRINTS--a protein motif fingerprint database. , 1994, Protein engineering.

[20]  D. C. Harris,et al.  Crystal structure of human chorionic gonadotropin , 1994, Nature.

[21]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[22]  David T. Jones,et al.  Protein superfamilles and domain superfolds , 1994, Nature.

[23]  Manuel G. Claros,et al.  TopPred II: an improved software for membrane protein structure predictions , 1994, Comput. Appl. Biosci..

[24]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[25]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[26]  A. F. Neuwald,et al.  Detecting patterns in protein sequences. , 1994, Journal of molecular biology.

[27]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[28]  S. Henikoff,et al.  Protein family classification based on searching a database of blocks. , 1994, Genomics.

[29]  C. Chothia,et al.  Volume changes in protein evolution. , 1994, Journal of molecular biology.

[30]  Erik L. L. Sonnhammer,et al.  A workbench for large-scale sequence homology analysis , 1994, Comput. Appl. Biosci..

[31]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[32]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[33]  R. Durbin,et al.  A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. , 1995, Gene.

[34]  C. J. Rawlings ISMB-95 : proceedings : third International Conference on Intelligent Systems for Molecular Biology , 1995 .

[35]  R. Waterston,et al.  The Nematode Caenorhabditis elegans and Its Genome , 1995, Science.

[36]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[37]  R. F. Smith,et al.  BEAUTY: an enhanced BLAST-based search tool that integrates multiple biological information resources into sequence similarity search results. , 1995, Genome research.

[38]  C. Chothia,et al.  Gene duplications in H. influenzae , 1995, Nature.

[39]  M. Yamazaki,et al.  Analysis of the nucleotide sequence of chromosome VI from Saccharomyces cerevisiae , 1995, Nature Genetics.

[40]  V. Schuster,et al.  Identification and characterization of a prostaglandin transporter. , 1995, Science.

[41]  Terri K. Attwood,et al.  Progress with the PRINTS protein fingerprint database , 1996, Nucleic Acids Res..

[42]  C Sander,et al.  Bioinformatics and the discovery of gene function. , 1996, Trends in genetics : TIG.

[43]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[44]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its new supplement TREMBL , 1996, Nucleic Acids Res..

[45]  Shmuel Pietrokovski,et al.  The Blocks database--a system for protein classification , 1996, Nucleic Acids Res..

[46]  Sándor Pongor,et al.  The SBASE protein domain library, Release 4.0: a collection of annotated protein sequence segments , 1993, Nucleic Acids Res..

[47]  Amos Bairoch,et al.  The PROSITE database, its status in 1995 , 1996, Nucleic Acids Res..

[48]  P. Bork,et al.  Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli , 1996, Current Biology.

[49]  Hans-Werner Mewes,et al.  The PIR-International Protein Sequence Database , 1992, Nucleic Acids Res..

[50]  Chris Sander,et al.  The FSSP database: fold classification based on structure-structure alignment of proteins , 1996, Nucleic Acids Res..

[51]  Chris Sander,et al.  The HSSP database of protein structure-sequence alignments , 1993, Nucleic Acids Res..

[52]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[53]  Cathy H. Wu,et al.  The PIR-International Protein Sequence Database , 1999, Nucleic Acids Res..