InterPro: An Integrated Documentation Resource for Protein Families, Domains and Functional Sites

The exponential increase in the submission of nucleotide sequences to the nucleotide sequence database by genome sequencing centres has resulted in a need for rapid, automatic methods for classification of the resulting protein sequences. There are several signature and sequence cluster-based methods for protein classification, each resource having distinct areas of optimum application owing to the differences in the underlying analysis methods. In recognition of this, InterPro was developed as an integrated documentation resource for protein families, domains and functional sites, to rationalise the complementary efforts of the individual protein signature database projects. The member databases - PRINTS, PROSITE, Pfam, ProDom, SMART and TIGRFAMs - form the InterPro core. Related signatures from each member database are unified into single InterPro entries. Each InterPro entry includes a unique accession number, functional descriptions and literature references, and links are made back to the relevant member database(s). Release 4.0 of InterPro (November 2001) contains 4,691 entries, representing 3,532 families, 1,068 domains, 74 repeats and 15 sites of post-translational modification (PTMs) encoded by different regular expressions, profiles, fingerprints and hidden Markov models (HMMs). Each InterPro entry lists all the matches against SWISS-PROT and TrEMBL (2,141,621 InterPro hits from 586,124 SWISS-PROT and TrEMBL protein sequences). The database is freely accessible for text- and sequence-based searches.

[1]  M. Gribskov,et al.  [9] Profile analysis , 1990 .

[2]  M. Gribskov,et al.  Profile Analysis , 1970 .

[3]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[4]  N. Guex,et al.  SWISS‐MODEL and the Swiss‐Pdb Viewer: An environment for comparative protein modeling , 1997, Electrophoresis.

[5]  Søren Brunak,et al.  A Neural Network Method for Identification of Prokaryotic and Eukaryotic Signal Peptides and Prediction of their Cleavage Sites , 1997, Int. J. Neural Syst..

[6]  D. Brutlag,et al.  Highly specific protein sequence motifs for genome analysis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Jérôme Gouzy,et al.  Browsing protein families via the 'Rich Family Description' format , 1999, Bioinform..

[8]  Terri K. Attwood,et al.  FingerPRINTScan: intelligent searching of the PRINTS motif database , 1999, Bioinform..

[9]  Rolf Apweiler,et al.  A novel method for automatic functional annotation of proteins , 1999, Bioinform..

[10]  Jérôme Gouzy,et al.  ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons , 2000, Nucleic Acids Res..

[11]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[12]  M. R. Adams,et al.  Comparative genomics of the eukaryotes. , 2000, Science.

[13]  S. Henikoff,et al.  Blocks‐based methods for detecting protein homology , 2000, Electrophoresis.

[14]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[15]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[16]  Chris P. Ponting,et al.  Issues in Predicting Protein Function From Sequence , 2001, Briefings Bioinform..

[17]  Fan Yang,et al.  TIGRFAMs: a protein family resource for the functional identification of proteins , 2001, Nucleic Acids Res..

[18]  Cathy H. Wu,et al.  iProClass: an integrated, comprehensive and annotated protein classification database , 2001, Nucleic Acids Res..

[19]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[20]  S. Cawley,et al.  Profiling the malaria genome: a gene survey of three species of malaria parasite with comparison to other apicomplexan species. , 2001, Molecular and biochemical parasitology.

[21]  Alex Bateman,et al.  The InterPro database, an integrated documentation resource for protein families, domains and functional sites , 2001, Nucleic Acids Res..

[22]  Rolf Apweiler,et al.  InterProScan - an integration platform for the signature-recognition methods in InterPro , 2001, Bioinform..

[23]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[24]  A. Oliphant,et al.  A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). , 2002, Science.

[25]  Y. Matsuo,et al.  Exploration of novel motifs derived from mouse cDNA sequences. , 2002, Genome research.

[26]  Terri K. Attwood,et al.  PRINTS and PRINTS-S shed light on protein ancestry , 2002, Nucleic Acids Res..

[27]  Peer Bork,et al.  Recent improvements to the SMART domain-based sequence annotation resource , 2002, Nucleic Acids Res..

[28]  Huanming Yang,et al.  A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. indica) , 2002, Science.

[29]  Frances M. G. Pearl,et al.  The CATH extended protein‐family database: Providing structural annotations for genome sequences , 2002, Protein science : a publication of the Protein Society.

[30]  Tim J. P. Hubbard,et al.  SCOP database in 2002: refinements accommodate structural genomics , 2002, Nucleic Acids Res..