Pfam 10 years on: 10 000 families and still growing

Classifications of proteins into groups of related sequences are in some respects like a periodic table for biology, allowing us to understand the underlying molecular biology of any organism. Pfam is a large collection of protein domains and families. Its scientific goal is to provide a complete and accurate classification of protein families and domains. The next release of the database will contain over 10,000 entries, which leads us to reflect on how far we are from completing this work. Currently Pfam matches 72% of known protein sequences, but for proteins with known structure Pfam matches 95%, which we believe represents the likely upper bound. Based on our analysis a further 28,000 families would be required to achieve this level of coverage for the current sequence database. We also show that as more sequences are added to the sequence databases the fraction of sequences that Pfam matches is reduced, suggesting that continued addition of new families is essential to maintain its relevance.

[1]  Burkhard Rost,et al.  Did evolution leap to create the protein universe? , 2002, Current opinion in structural biology.

[2]  L. Holm,et al.  Exhaustive enumeration of protein domain families. , 2003, Journal of molecular biology.

[3]  Hiroyuki Toh,et al.  Improvement in the accuracy of multiple sequence alignment program MAFFT. , 2005, Genome informatics. International Conference on Genome Informatics.

[4]  Mounir Errami,et al.  Detection of unrelated proteins in sequences multiple alignments by using predicted secondary structures , 2003, Bioinform..

[5]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[6]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[7]  Yaoqi Zhou,et al.  SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. , 2005, Bioinformatics.

[8]  Liam J. McGuffin,et al.  Improvement of the GenTHREADER Method for Genomic Fold Recognition , 2003, Bioinform..

[9]  Daniel Fischer,et al.  Structural biology sheds light on the puzzle of genomic ORFans. , 2004, Journal of molecular biology.

[10]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[11]  Anton J. Enright,et al.  Myriads of protein families, and still counting , 2003, Genome Biology.

[12]  C. Khosla,et al.  Role of linkers in communication between protein modules. , 2000, Current opinion in chemical biology.

[13]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[14]  David A. Lee,et al.  Identification and distribution of protein families in 120 completed genomes using Gene3D , 2005, Proteins.

[15]  David S. Eisenberg,et al.  Finding families for genomic ORFans , 1999, Bioinform..

[16]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[17]  R. Sauer,et al.  Optimizing the stability of single-chain proteins by linker length and composition mutagenesis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Liisa Holm,et al.  ADDA: a domain database with global coverage of the protein universe , 2004, Nucleic Acids Res..

[19]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[20]  Jaap Heringa,et al.  An analysis of protein domain linkers: their classification and role in protein folding. , 2002, Protein engineering.

[21]  C. Chothia Proteins. One thousand families for the molecular biologist. , 1992, Nature.

[22]  H. Ochman,et al.  Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli. , 2004, Genome research.

[23]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[24]  Daniel W. A. Buchan,et al.  A structural perspective on genome evolution. , 2003 .

[25]  D. Eisenberg,et al.  Protein function in the post-genomic era , 2000, Nature.

[26]  Burkhard Rost,et al.  CHOP proteins into structural domain‐like fragments , 2004, Proteins.

[27]  James E. Bray,et al.  The CATH database: an extended protein family resource for structural and functional genomics , 2003, Nucleic Acids Res..

[28]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[29]  Felipe A. Veloso,et al.  Large-scale, multi-genome analysis of alternate open reading frames in bacteria and archaea. , 2005, Omics : a journal of integrative biology.

[30]  Nathan Linial,et al.  EVEREST: a collection of evolutionary conserved protein domains , 2006, Nucleic Acids Res..

[31]  D. Fischer,et al.  Analysis of singleton ORFans in fully sequenced microbial genomes , 2003, Proteins.

[32]  C. Orengo,et al.  Protein families and their evolution-a structural perspective. , 2005, Annual review of biochemistry.

[33]  Sébastien Carrère,et al.  The ProDom database of protein domain families: more emphasis on 3D , 2004, Nucleic Acids Res..

[34]  Fei Long,et al.  Contrasting Membrane Interaction Mechanisms of AP180 N-terminal Homology (ANTH) and Epsin N-terminal Homology (ENTH) Domains* , 2003, Journal of Biological Chemistry.