InterPro protein classification.

Improvements in nucleotide sequencing technology have resulted in an ever increasing number of nucleotide and protein sequences being deposited in databases. Unfortunately, the ability to manually classify and annotate these sequences cannot keep pace with their rapid generation, resulting in an increased bias toward unannotated sequence. Automatic annotation tools can help redress the balance. There are a number of different groups working to produce protein signatures that describe protein families, functional domains or conserved sites within related groups of proteins. Protein signature databases include CATH-Gene3D, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY, and TIGRFAMs. Their approaches range from characterising small conserved motifs that can identify members of a family or subfamily, to the use of hidden Markov models that describe the conservation of residues over entire domains or whole proteins. To increase their value as protein classification tools, protein signatures from these 11 databases have been combined into one, powerful annotation tool: the InterPro database (http://www.ebi.ac.uk/interpro/) (Hunter et al., Nucleic Acids Res 37:D211-D215, 2009). InterPro is an open-source protein resource used for the automatic annotation of proteins, and is scalable to the analysis of entire new genomes through the use of a downloadable version of InterProScan, which can be incorporated into an existing local pipeline. InterPro provides structural information from PDB (Kouranov et al., Nucleic Acids Res 34:D302-D305, 2006), its classification in CATH (Cuff et al., Nucleic Acids Res 37:D310-D314, 2009) and SCOP (Andreeva et al., Nucleic Acids Res 36:D419-D425, 2008), as well as homology models from ModBase (Pieper et al., Nucleic Acids Res 37:D347-D354, 2009) and SwissModel (Kiefer et al., Nucleic Acids Res 37:D387-D392, 2009), allowing a direct comparison of the protein signatures with the available structural information. This chapter reviews the signature methods found in the InterPro database, and provides an overview of the InterPro resource itself.

[1]  Robert D. Finn,et al.  InterPro: the integrative protein signature database , 2008, Nucleic Acids Res..

[2]  Robert S. Ledley,et al.  PIRSF: family classification system at the Protein Information Resource , 2004, Nucleic Acids Res..

[3]  Owen White,et al.  Genome Properties: a system for the investigation of prokaryotic genetic content for microbiology, genome annotation and comparative genomics , 2005, Bioinform..

[4]  Rolf Apweiler,et al.  IntEnz, the integrated relational enzyme database , 2004, Nucleic Acids Res..

[5]  Jérôme Gouzy,et al.  ProDom: Automated Clustering of Homologous Domains , 2002, Briefings Bioinform..

[6]  Rolf Apweiler,et al.  InterProScan: protein domains identifier , 2005, Nucleic Acids Res..

[7]  Torsten Schwede,et al.  The SWISS-MODEL Repository and associated resources , 2008, Nucleic Acids Res..

[8]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[9]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[10]  E. Birney,et al.  Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Research.

[11]  Amos Bairoch,et al.  PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors , 2002, Briefings Bioinform..

[12]  Andreas Prlic,et al.  Adding Some SPICE to DAS , 2005, ECCB/JBI.

[13]  Neil D. Rawlings,et al.  MEROPS: the peptidase database , 2009, Nucleic Acids Res..

[14]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[15]  Hiroaki Kitano,et al.  The PANTHER database of protein families, subfamilies, functions and pathways , 2004, Nucleic Acids Res..

[16]  Owen White,et al.  The TIGRFAMs database of protein families , 2003, Nucleic Acids Res..

[17]  Nigel J. Martin,et al.  Gene3D: comprehensive structural and functional annotation of genomes , 2007, Nucleic Acids Res..

[18]  Rolf Apweiler,et al.  Dasty and UniProt DAS: a perfect pair for protein feature visualization , 2005, Bioinform..

[19]  Adel Golovin,et al.  MSDmotif: exploring protein sites and motifs , 2008, BMC Bioinformatics.

[20]  Ian Sillitoe,et al.  The CATH classification revisited—architectures reviewed and new ways to characterize structural divergence in superfamilies , 2008, Nucleic Acids Res..

[21]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[22]  Alexander García Castro,et al.  Dasty2, an Ajax protein DAS client , 2008, Bioinform..

[23]  Elisabeth Coudert,et al.  HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot , 2008, Nucleic Acids Res..

[24]  Junjun Zhang,et al.  BioMart Central Portal—unified access to biological data , 2009, Nucleic Acids Res..

[25]  Brandi L. Cantarel,et al.  The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics , 2008, Nucleic Acids Res..

[26]  M. Gribskov,et al.  [9] Profile analysis , 1990 .

[27]  Nicolas Rodriguez,et al.  PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees , 2005, Nucleic Acids Res..

[28]  Li Ni,et al.  The Gene Ontology's Reference Genome Project: A Unified Framework for Functional Annotation across Species , 2009, PLoS Comput. Biol..

[29]  Peer Bork,et al.  SMART 6: recent updates and new developments , 2008, Nucleic Acids Res..

[30]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[31]  Liisa Holm,et al.  ADDA: a domain database with global coverage of the protein universe , 2004, Nucleic Acids Res..

[32]  Martin Jones,et al.  IUPHAR-DB: the IUPHAR database of G protein-coupled receptors and ion channels , 2008, Nucleic Acids Res..

[33]  Michael J. Hartshorn,et al.  AstexViewerTM †: a visualisation aid for structure-based drug design , 2002, J. Comput. Aided Mol. Des..

[34]  Lennart Martens,et al.  PRIDE: new developments and new datasets , 2007, Nucleic Acids Res..

[35]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[36]  Sergio Contrino,et al.  COMe: the ontology of bioinorganic proteins , 2003, BMC Structural Biology.

[37]  Terri K. Attwood,et al.  The PRINTS Database: A Resource for Identification of Protein Families , 2002, Briefings Bioinform..

[38]  Philip E. Bourne,et al.  The RCSB PDB information portal for structural genomics , 2005, Nucleic Acids Res..

[39]  Robert Petryszak,et al.  The predictive power of the CluSTr database , 2005, Bioinform..

[40]  Cyrus Chothia,et al.  SUPERFAMILY—sophisticated comparative genomics, data mining, visualization and phylogeny , 2008, Nucleic Acids Res..

[41]  Y. Zhang,et al.  IntAct—open source resource for molecular interaction data , 2006, Nucleic Acids Res..