PredictProtein - Predicting Protein Structure and Function for 29 Years

Abstract Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre for Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. PredictProtein was the first Internet server for protein predictions. It pioneered combining evolutionary information and machine learning. Given a protein sequence as input, the server outputs multiple sequence alignments, predictions of protein structure in 1D and 2D (secondary structure, solvent accessibility, transmembrane segments, disordered regions, protein flexibility, and disulfide bridges) and predictions of protein function (functional effects of sequence variation or point mutations, Gene Ontology (GO) terms, subcellular localization, and protein-, RNA-, and DNA binding). PredictProtein's infrastructure has moved to the LCSB increasing throughput; the use of MMseqs2 sequence search reduced runtime five-fold (apparently without lowering performance of prediction methods); user interface elements improved usability, and new prediction methods were added. PredictProtein recently included predictions from deep learning embeddings (GO and secondary structure) and a method for the prediction of proteins and residues binding DNA, RNA, or other proteins. PredictProtein.org aspires to provide reliable predictions to computational and experimental biologists alike. All scripts and methods are freely available for offline execution in high-throughput settings.

[1]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[2]  Chris Sander,et al.  Jury returns on structure prediction , 1992, Nature.

[3]  B. Rost,et al.  Improved prediction of protein secondary structure by use of sequence profiles and neural networks. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[4]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[5]  B. Rost PHD: predicting one-dimensional protein structure by profile-based neural networks. , 1996, Methods in enzymology.

[6]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[7]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[8]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[9]  B. Rost,et al.  Automatic prediction of protein function , 2003, Cellular and Molecular Life Sciences CMLS.

[10]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[11]  Piero Fariselli,et al.  ConSeq: the identification of functionally and structurally important residues in protein sequences , 2004, Bioinform..

[12]  Burkhard Rost,et al.  PROFtmb: a web server for predicting bacterial transmembrane beta barrel proteins , 2006, Nucleic Acids Res..

[13]  Alessio Ceroni,et al.  DISULFIND: a disulfide bonding state and cysteine connectivity prediction server , 2006, Nucleic Acids Res..

[14]  Avner Schlessinger,et al.  PROFbval: predict flexible and rigid residues in proteins , 2006, Bioinform..

[15]  Avner Schlessinger,et al.  Improved Disorder Prediction by Combination of Orthogonal Approaches , 2009, PloS one.

[16]  Tal Pupko,et al.  ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids , 2010, Nucleic Acids Res..

[17]  Francisco Melo,et al.  The Protein-DNA Interface database , 2010, BMC Bioinformatics.

[18]  Vasant Honavar,et al.  PRIDB: a protein–RNA interface database , 2010, Nucleic Acids Res..

[19]  Gunnar Rätsch,et al.  Persistence and Availability of Web Services in Computational Biology , 2011, PloS one.

[20]  Alan Bridge,et al.  New and continuing developments at PROSITE , 2012, Nucleic Acids Res..

[21]  László Kaján,et al.  Cloud Prediction of Protein Structure and Function with PredictProtein for Debian , 2013, BioMed research international.

[22]  B. Rost,et al.  Accelerating the Original Profile Kernel , 2013, PloS one.

[23]  Itay Mayrose,et al.  ConSurf: Using Evolutionary Data to Raise Testable Hypotheses about Protein Function , 2013 .

[24]  Burkhard Rost,et al.  LocTree3 prediction of localization , 2014, Nucleic Acids Res..

[25]  Avner Schlessinger,et al.  PredictProtein—an open resource for online prediction of protein structural and functional features , 2014, Nucleic Acids Res..

[26]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[27]  Prudence Mutowo-Meullenet,et al.  The GOA database: Gene Ontology annotation updates for 2015 , 2014, Nucleic Acids Res..

[28]  B. Rost,et al.  Better prediction of functional effects for sequence variants , 2015, BMC Genomics.

[29]  Fabian A. Buske,et al.  Aquaria: simplifying discovery and insight from protein structures , 2015, Nature Methods.

[30]  Burkhard Rost,et al.  Evolutionary profiles improve protein-protein interaction prediction from sequence , 2015, Bioinform..

[31]  B. Rost,et al.  TMSEG: Novel prediction of transmembrane helices , 2016, Proteins.

[32]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[33]  Itay Mayrose,et al.  ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules , 2016, Nucleic Acids Res..

[34]  Cory B. Giles,et al.  Use it or lose it: citations predict the continued online availability of published bioinformatics resources , 2017, Nucleic acids research.

[35]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[36]  Maria Jesus Martin,et al.  ProtVista: visualization of protein sequence annotations , 2017, Bioinform..

[37]  Maria Jesus Martin,et al.  Uniclust databases of clustered and deeply annotated protein sequences and alignments , 2016, Nucleic Acids Res..

[38]  Johannes Söding,et al.  Clustering huge protein sequence sets in linear time , 2017, Nature Communications.

[39]  Mohammed AlQuraishi,et al.  End-to-end differentiable learning of protein structure , 2018, bioRxiv.

[40]  Piotr Gawron,et al.  MolArt: a molecular structure annotation and visualization tool , 2018, Bioinform..

[41]  Andriy Kryshtafovych,et al.  Assessment of hard target modeling in CASP12 reveals an emerging role of alignment‐based contact prediction methods , 2018, Proteins.

[42]  Chris Sander,et al.  AlignmentViewer: Sequence Analysis of Large Protein Families , 2018 .

[43]  Burkhard Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[44]  Ole Winther,et al.  NetSurfP‐2.0: Improved prediction of protein structural features by integrated deep learning , 2019, Proteins.

[45]  Jari Björne,et al.  The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens , 2019, Genome Biology.

[46]  Mohammed AlQuraishi End-to-end differentiable learning of protein structure , 2018, bioRxiv.

[47]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[48]  Alice C McHardy,et al.  Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX) , 2018, Scientific Reports.

[49]  Johannes Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, Nature Methods.

[50]  George M. Church,et al.  Unified rational protein engineering with sequence-based deep representation learning , 2019, Nature Methods.

[51]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[52]  M Mirdita,et al.  MMseqs2 desktop and local web server app for fast, interactive sequence searches , 2018, bioRxiv.

[53]  Tom Sercu,et al.  Transformer protein language models are unsupervised structure learners , 2020, bioRxiv.

[54]  Nadia El-Mabrouk,et al.  ISMB 2020 proceedings , 2020, Bioinform..

[55]  C. Sander,et al.  AlignmentViewer: Sequence Analysis of Large Protein Families , 2020, F1000Research.

[56]  A. Godzik,et al.  Crystal structure of RNA binding domain of nucleocapsid phosphoprotein from SARS coronavirus 2 , 2020 .

[57]  Structural basis of RNA recognition by the SARS-CoV-2 nucleocapsid phosphoprotein , 2020, PLoS pathogens.

[58]  B. Rost,et al.  ProNA2020 predicts protein-DNA, protein-RNA and protein-protein binding proteins and residues from sequence. , 2020, Journal of molecular biology.

[59]  G. Minasov,et al.  2.05 Angstrom Resolution Crystal Structure of C-terminal Dimerization Domain of Nucleocapsid Phosphoprotein from SARS-CoV-2 , 2020 .

[60]  Anne Morgat,et al.  UniRule: a unified rule resource for automatic annotation in the UniProt Knowledgebase , 2020, Bioinformatics.

[61]  Burkhard Rost,et al.  Visualizing Human Protein‐Protein Interactions and Subcellular Localizations on Cell Images Through CellMap , 2020, Current protocols in bioinformatics.

[62]  B. Rost,et al.  ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing , 2020, bioRxiv.

[63]  A. Keller,et al.  On the lifetime of bioinformatics web services , 2020, Nucleic acids research.

[64]  Ewen Callaway,et al.  ‘It will change everything’: DeepMind’s AI makes gigantic leap in solving protein structures , 2020, Nature.

[65]  AlignmentViewer: Sequence Analysis of Large Protein Families [version 1; peer review: 1 approved, 1 approved with reservations] , 2021 .

[66]  Kevin K. Yang,et al.  Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets , 2021, Current protocols.

[67]  Tom Sercu,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2021, Proceedings of the National Academy of Sciences.

[68]  Michael Heinzinger,et al.  Embeddings from deep learning transfer GO annotations beyond homology , 2021, Scientific reports.

[69]  Peter B. McGarvey,et al.  UniProt: the universal protein knowledgebase in 2021 , 2020, Nucleic Acids Res..