MetaFam: a unified classification of protein families. I. Overview and statistics

MOTIVATION Protein sequence classification is becoming an increasingly important means of organizing the voluminous data produced by large-scale genome sequencing projects. At present, there are several independent classification methods. To aid the general classification effort, we have created a unified protein family resource, MetaFam. MetaFam is a protein family classification built upon 10 publicly-accessible protein family databases (Blocks + DOMO, Pfam, PIR-ALN, PRINTS, PROSITE, ProDom, PROTOMAP, SBASE, and SYSTERS). MetaFam's family 'supersets', as we call them, are created automatically using set-theory to compare families among the databases. Families of one database are matched to those in another when the intersection of their members exceeds all other possible family pairings between the two databases. Pairwise family matches are drawn together transitively to create a new list of protein family supersets. RESULTS MetaFam family supersets have several useful features: (1) each superset contains more members than the families from which it is composed, because each of the component family databases only works with a subset of our full non-redundant set of proteins; (2) conflicting assignments can be pinpointed quickly, since our analysis identifies individual members that are in conflict with the majority consensus; (3) family descriptions that are absent from automated databases can frequently be assigned; (4) statistics have been computed comparing domain boundaries, family size distributions, and overall quality of MetaFam supersets; (5) the supersets have been loaded into a relational database to allow for complex queries and visualization of the connections among families in a superset and the consensus of individual domain members; and (6) the quality of individual supersets has been assessed using numerous quantitative measures such as family consistency, connectedness, and size. We anticipate this new resource will be particularly useful to genomic database curators.

[1]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 1999, Nucleic Acids Res..

[2]  Jérôme Gouzy,et al.  ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons , 2000, Nucleic Acids Res..

[3]  Shmuel Pietrokovski,et al.  Increased coverage of protein families with the Blocks Database servers , 2000, Nucleic Acids Res..

[4]  Sándor Pongor,et al.  The SBASE protein domain library, Release 4.0: a collection of annotated protein sequence segments , 1993, Nucleic Acids Res..

[5]  Cathy H. Wu,et al.  ProClass protein family database , 2000, Nucleic Acids Res..

[6]  T L Blundell,et al.  A database of globular protein structural domains: clustering of representative family members into similar folds. , 1996, Folding & design.

[7]  Jérôme Gracy,et al.  Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment , 1998, Bioinform..

[8]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[9]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[10]  Peer Bork,et al.  SMART: a web-based tool for the study of genetically mobile domains , 2000, Nucleic Acids Res..

[11]  Amos Bairoch,et al.  The PROSITE database, its status in 1999 , 1999, Nucleic Acids Res..

[12]  J. Hein,et al.  Combining many multiple alignments in one improved alignment , 1999, Bioinform..

[13]  Chris Sander,et al.  Protein folds and families: sequence and structure alignments , 1999, Nucleic Acids Res..

[14]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[15]  Jérôme Gracy,et al.  Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities , 1998, Bioinform..

[16]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[17]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[18]  Terri K. Attwood,et al.  PRINTS-S: the database formerly known as PRINTS , 2000, Nucleic Acids Res..

[19]  Sándor Pongor,et al.  The SBASE protein domain library, release 7.0: a collection of annotated protein sequence segments , 2000, Nucleic Acids Res..

[20]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[21]  Nathan Linial,et al.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[22]  S. Henikoff,et al.  Embedding strategies for effective use of information from multiple sequence alignments , 1997, Protein science : a publication of the Protein Society.

[23]  Martin Vingron,et al.  The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..

[24]  Winona C. Barker,et al.  PIR-ALN: a database of protein sequence alignments , 1999, Bioinform..

[25]  Arne Elofsson,et al.  A comparison of sequence and structure protein domain families as a basis for structural genomics , 1999, Bioinform..

[26]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[27]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[29]  T. K. Attwood,et al.  ADSP - a new package for computational sequence analysis , 1992, Comput. Appl. Biosci..

[30]  Alex Bateman,et al.  InterPro: An Integrated Documentation Resource for Protein Families, Domains and Functional Sites , 2002, Briefings Bioinform..

[31]  Friedhelm Pfeiffer,et al.  Database of protein sequence alignments: PIR-ALN , 1999, Nucleic Acids Res..

[32]  HistoricalAccession,et al.  MetaFam: a unified classification of protein families. II. Schema and query capabilities , 2001 .

[33]  Z. X. Wang,et al.  A re-estimation for the total numbers of protein folds and superfamilies. , 1998, Protein engineering.

[34]  Shmuel Pietrokovski,et al.  Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations , 1999, Bioinform..

[35]  Peter B. McGarvey,et al.  The Protein Information Resource (PIR) , 2000, Nucleic Acids Res..

[36]  Kay Hofmann,et al.  Protein classification and functional assignment , 1998 .

[37]  James E. Bray,et al.  The CATH Database provides insights into protein structure/function relationships , 1999, Nucleic Acids Res..