Pfam: the protein families database

Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface. We also discuss the mapping between Pfam and known 3D structures.

[1]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1977, Journal of molecular biology.

[2]  John C. Wootton,et al.  Non-globular Domains in Protein Sequences: Automated Segmentation Using Complexity Measures , 1994, Comput. Chem..

[3]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[4]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[5]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[6]  Liisa Holm,et al.  RSDB: representative protein sequence databases have high information content , 2000, Bioinform..

[7]  Jérôme Gouzy,et al.  ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons , 2000, Nucleic Acids Res..

[8]  C. Khosla,et al.  Role of linkers in communication between protein modules. , 2000, Current opinion in chemical biology.

[9]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[10]  Jaap Heringa,et al.  An analysis of protein domain linkers: their classification and role in protein folding. , 2002, Protein engineering.

[11]  Julie D Thompson,et al.  Multiple Sequence Alignment Using ClustalW and ClustalX , 2003, Current protocols in bioinformatics.

[12]  Alex Bateman,et al.  QuickTree: building huge Neighbour-Joining trees of protein sequences , 2002, Bioinform..

[13]  Jörg Schultz,et al.  HMM Logos for visualization of protein families , 2004, BMC Bioinformatics.

[14]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[15]  R. Durbin,et al.  GeneWise and Genomewise. , 2004, Genome research.

[16]  KharHengChoo,et al.  Recent Applications of Hidden Markov Models in Computational Biology , 2004 .

[17]  Alex Bateman,et al.  Enhanced protein domain discovery using taxonomy , 2004, BMC Bioinformatics.

[18]  A. Krogh,et al.  A combined transmembrane topology and signal peptide prediction method. , 2004, Journal of molecular biology.

[19]  Gunter Schneider,et al.  Determination of Structural Principles Underlying Three Different Modes of Lymphocytic Choriomeningitis Virus Escape from CTL Recognition1 , 2004, The Journal of Immunology.

[20]  Robert D. Finn,et al.  The Pfam protein families database , 2007, Nucleic Acids Res..

[21]  P. Tompa,et al.  The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. , 2005, Journal of molecular biology.

[22]  Zsuzsanna Dosztányi,et al.  IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content , 2005, Bioinform..

[23]  Willy Wriggers,et al.  Control of protein functional dynamics by peptide linkers. , 2005, Biopolymers.

[24]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[25]  Liisa Holm,et al.  ADDA: a domain database with global coverage of the protein universe , 2004, Nucleic Acids Res..

[26]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[27]  Fidel Ramírez,et al.  Functional evaluation of domain-domain interactions and human protein interaction networks , 2007, Bioinform..

[28]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[29]  Robert D. Finn,et al.  Predicting active site residue annotations in the Pfam database , 2007, BMC Bioinformatics.

[30]  Robert Finn,et al.  Pfam: a domain-centric method for analyzing proteins and proteomes. , 2007, Methods in molecular biology.

[31]  Robert D. Finn,et al.  SCOOP: a simple method for identification of novel protein superfamily relationships , 2007, Bioinform..

[32]  Haruki Nakamura,et al.  Remediation of the protein data bank archive , 2007, Nucleic Acids Res..

[33]  Tao Liu,et al.  TreeFam: 2008 Update , 2007, Nucleic Acids Res..

[34]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[35]  Robert D. Finn,et al.  Pfam 10 years on: 10 000 families and still growing , 2008, Briefings Bioinform..

[36]  Martin Madera,et al.  Profile Comparer: a program for scoring and aligning profile hidden Markov models , 2008, Bioinform..

[37]  Andreas Prlic,et al.  The Protein Feature Ontology: a tool for the unification of protein feature annotations , 2008, Bioinform..

[38]  Philip E. Bourne,et al.  BioLit: integrating biological literature with databases , 2008, Nucleic Acids Res..

[39]  Liisa Holm,et al.  PairsDB atlas of protein sequence space , 2007, Nucleic Acids Res..

[40]  A Keith Dunker,et al.  Unfoldomics of human genetic diseases: illustrative examples of ordered and intrinsically disordered members of the human diseasome. , 2009, Protein and peptide letters.

[41]  Geoffrey J. Barton,et al.  Jalview Version 2—a multiple sequence alignment editor and analysis workbench , 2009, Bioinform..

[42]  Robert D. Finn,et al.  InterPro: the integrative protein signature database , 2008, Nucleic Acids Res..

[43]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.

[44]  Robert D. Finn,et al.  Rfam: updates to the RNA families database , 2008, Nucleic Acids Res..

[45]  Adam P. Arkin,et al.  FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix , 2009, Molecular biology and evolution.

[46]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[47]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..