SIMAP—the database of all-against-all protein sequence similarities and annotations with new interfaces and increased coverage

The Similarity Matrix of Proteins (SIMAP, http://mips.gsf.de/simap/) database has been designed to massively accelerate computationally expensive protein sequence analysis tasks in bioinformatics. It provides pre-calculated sequence similarities interconnecting the entire known protein sequence universe, complemented by pre-calculated protein features and domains, similarity clusters and functional annotations. SIMAP covers all major public protein databases as well as many consistently re-annotated metagenomes from different repositories. As of September 2013, SIMAP contains >163 million proteins corresponding to ∼70 million non-redundant sequences. SIMAP uses the sensitive FASTA search heuristics, the Smith–Waterman alignment algorithm, the InterPro database of protein domain models and the BLAST2GO functional annotation algorithm. SIMAP assists biologists by facilitating the interactive exploration of the protein sequence universe. Web-Service and DAS interfaces allow connecting SIMAP with any other bioinformatic tool and resource. All-against-all protein sequence similarity matrices of project-specific protein collections are generated on request. Recent improvements allow SIMAP to cover the rapidly growing sequenced protein sequence universe. New Web-Service interfaces enhance the connectivity of SIMAP. Novel tools for interactive extraction of protein similarity networks have been added. Open access to SIMAP is provided through the web portal; the portal also contains instructions and links for software access and flat file downloads.

[1]  Tsviya Olender,et al.  GeneCards Version 3: the human gene integrator , 2010, Database J. Biol. Databases Curation.

[2]  Patricia C. Babbitt,et al.  Pythoscape: a framework for generation of large protein similarity networks , 2012, Bioinform..

[3]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[4]  R. Agarwala,et al.  Protein database searches using compositionally adjusted substitution matrices , 2005, The FEBS journal.

[5]  Juan Miguel García-Gómez,et al.  BIOINFORMATICS APPLICATIONS NOTE Sequence analysis Manipulation of FASTQ data with Galaxy , 2005 .

[6]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[7]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[8]  Gaston H. Gonnet,et al.  OMA 2011: orthology inference among 1000 complete genomes , 2010, Nucleic Acids Res..

[9]  Stephen F. Altschul,et al.  The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions , 2005, Bioinform..

[10]  Erich Bornberg-Bauer,et al.  Rapid similarity search of proteins using alignments of domain arrangements , 2014, Bioinform..

[11]  Nathan Linial,et al.  ProtoNet 6.0: organizing 10 million protein sequences in a compact hierarchical family tree , 2011, Nucleic Acids Res..

[12]  Rick L. Stevens,et al.  The RAST Server: Rapid Annotations using Subsystems Technology , 2008, BMC Genomics.

[13]  Yoshihiro Yamanishi,et al.  KEGG OC: a large-scale automatic construction of taxonomy-based ortholog clusters , 2012, Nucleic Acids Res..

[14]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy , 2011, Nucleic Acids Res..

[15]  Peter M. A. Sloot,et al.  Distributed, High-Performance and Grid Computing in Computational Biology , International Workshop, GCCB 2006, Eilat, Israel, January 21, 2007, Proceeding , 2007, GCCB.

[16]  Andrei N. Lupas,et al.  CLANS: a Java application for visualizing protein families based on pairwise similarity , 2004, Bioinform..

[17]  Werner Dubitzky,et al.  Distributed, High-Performance and Grid Computing in Computational Biology : International Workshop, GCCB 2006, Eilat, Israel, January 21, 2007 : proceedings , 2007 .

[18]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[19]  Damian Szklarczyk,et al.  eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges , 2011, Nucleic Acids Res..

[20]  Thomas Rattei,et al.  SIMAP—structuring the network of protein similarities , 2007, Nucleic Acids Res..

[21]  Thomas Rattei,et al.  SIMAP - The similarity matrix of proteins , 2005, ECCB/JBI.

[22]  Dmitrij Frishman,et al.  PEDANT covers all complete RefSeq genomes , 2008, Nucleic Acids Res..

[23]  Robert D. Finn,et al.  InterPro in 2011: new developments in the family and domain prediction database , 2011, Nucleic acids research.

[24]  Robert Petryszak,et al.  The predictive power of the CluSTr database , 2005, Bioinform..

[25]  Stefan Götz,et al.  SIMAP—a comprehensive database of pre-calculated protein sequence similarities, domains, annotations and clusters , 2009, Nucleic Acids Res..

[26]  Damian Szklarczyk,et al.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration , 2012, Nucleic Acids Res..

[27]  Nick V. Grishin,et al.  Pclust: protein network visualization highlighting experimental data , 2013, Bioinform..

[28]  Thomas E. Ferrin,et al.  Using Sequence Similarity Networks for Visualization of Relationships Across Diverse Protein Superfamilies , 2009, PloS one.

[29]  Gautier Koscielny,et al.  Ensembl Genomes: an integrative resource for genome-scale data from non-vertebrate species , 2011, Nucleic Acids Res..

[30]  David P. Anderson,et al.  Using Public Resource Computing and Systematic Pre-calculation for Large Scale Sequence Analysis , 2006, GCCB.

[31]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[32]  Shoshana J. Wodak,et al.  CYGD: the Comprehensive Yeast Genome Database , 2004, Nucleic Acids Res..

[33]  M. Levitt Nature of the protein universe , 2009, Proceedings of the National Academy of Sciences.