SIMAP—structuring the network of protein similarities

Protein sequences are the most important source of evolutionary and functional information for new proteins. In order to facilitate the computationally intensive tasks of sequence analysis, the Similarity Matrix of Proteins (SIMAP) database aims to provide a comprehensive and up-to-date dataset of the pre-calculated sequence similarity matrix and sequence-based features like InterPro domains for all proteins contained in the major public sequence databases. As of September 2007, SIMAP covers ∼17 million proteins and more than 6 million non-redundant sequences and provides a complete annotation based on InterPro 16. Novel features of SIMAP include a new, portlet-based web portal providing multiple, structured views on retrieved proteins and integration of protein clusters and a unique search method for similar domain architectures. Access to SIMAP is freely provided for academic use through the web portal for individuals at http://mips.gsf.de/simap/and through Web Services for programmatic access at http://mips.gsf.de/webservices/services/SimapService2.0?wsdl.

[1]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[2]  Thomas Rattei,et al.  SIMAP - The similarity matrix of proteins , 2005, ECCB/JBI.

[3]  Robert D. Finn,et al.  New developments in the InterPro database , 2007, Nucleic Acids Res..

[4]  S. Brunak,et al.  Improved prediction of signal peptides: SignalP 3.0. , 2004, Journal of molecular biology.

[5]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[6]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[7]  David P. Anderson,et al.  Using Public Resource Computing and Systematic Pre-calculation for Large Scale Sequence Analysis , 2006, GCCB.

[8]  Lei Zhu,et al.  An initial strategy for comparing proteins at the domain architecture level , 2006, Bioinform..

[9]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[10]  Joaquín Dopazo,et al.  PupasView: a visual tool for selecting suitable SNPs, with putative pathological effect in genes, for genotyping purposes , 2005, Nucleic Acids Res..

[11]  Rolf Apweiler,et al.  Improvements to CluSTr: the database of SWISS-PROT+TrEMBL protein clusters , 2003, Nucleic Acids Res..

[12]  Dmitrij Frishman,et al.  PEDANT genome database: 10 years online , 2006, Nucleic Acids Res..

[13]  Qing Zhang,et al.  The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema , 2004, Nucleic Acids Res..

[14]  Dmitrij Frishman,et al.  PROMPT: a protein mapping and comparison tool , 2006, BMC Bioinformatics.

[15]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[16]  S. Kurtz The Vmatch large scale sequence analysis software , 2003 .

[17]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[19]  Ori Sasson,et al.  ProtoNet 4.0: A hierarchical classification of one million protein sequences , 2004, Nucleic Acids Res..

[20]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[21]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[22]  Igor V. Tetko,et al.  Super paramagnetic clustering of protein sequences , 2005, BMC Bioinformatics.