ASAP: a machine learning framework for local protein properties

Determining residue-level protein properties, such as sites of post-translational modifications (PTMs), is vital to understanding protein function. Experimental methods are costly and time-consuming, while traditional rule-based computational methods fail to annotate sites lacking substantial similarity. Machine Learning (ML) methods are becoming fundamental in annotating unknown proteins and their heterogeneous properties. We present ASAP (Amino-acid Sequence Annotation Prediction), a universal ML framework for predicting residue-level properties. ASAP extracts numerous features from raw sequences, and supports easy integration of external features such as secondary structure, solvent accessibility, intrinsically disorder or PSSM profiles. Features are then used to train ML classifiers. ASAP can create new classifiers within minutes for a variety of tasks, including PTM prediction (e.g. cleavage sites by convertase, phosphoserine modification). We present a detailed case study for ASAP: CleavePred, an ASAP-based model to predict protein precursor cleavage sites, with state-of-the-art results. Protein cleavage is a PTM shared by a wide variety of proteins sharing minimal sequence similarity. Current rule-based methods suffer from high false positive rates, making them suboptimal. The high performance of CleavePred makes it suitable for analyzing new proteomes at a genomic scale. The tool is attractive to protein design, mass spectrometry search engines and the discovery of new bioactive peptides from precursors. ASAP functions as a baseline approach for residue-level protein sequence prediction. CleavePred is freely accessible as a web-based application. Both ASAP and CleavePred are open-source with a flexible Python API. Database URL: ASAP’s and CleavePred source code, webtool and tutorials are available at: https://github.com/ddofer/asap; http://protonet.cs.huji.ac.il/cleavepred.

[1]  Jaime Prilusky,et al.  FoldIndex copyright: a simple tool to predict whether a given protein sequence is intrinsically unfolded , 2005, Bioinform..

[2]  Ron D. Appel,et al.  ExPASy: the proteomics server for in-depth protein knowledge and analysis , 2003, Nucleic Acids Res..

[3]  Vassilios Ioannidis,et al.  ExPASy: SIB bioinformatics resource portal , 2012, Nucleic Acids Res..

[4]  Christian J. A. Sigrist,et al.  ProRule: a new database containing functional and structural information on PROSITE profiles , 2005, Bioinform..

[5]  Doron Gerber,et al.  Computational identification of natural peptides based on analysis of molecular evolution , 2014, Bioinform..

[6]  Geert Wets,et al.  Bioinformatic approaches to the identification of novel neuropeptide precursors. , 2010, Methods in molecular biology.

[7]  J. L. King,et al.  Non-Darwinian evolution. , 1969, Science.

[8]  Jonathan V Sweedler,et al.  Bridging neuropeptidomics and genomics with bioinformatics: Prediction of mammalian neuropeptide prohormone processing. , 2006, Journal of proteome research.

[9]  Michel Schneider,et al.  UniProtKB/Swiss-Prot. , 2007, Methods in molecular biology.

[10]  Pierre Baldi,et al.  SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity , 2014, Bioinform..

[11]  Michal Linial,et al.  NeuroPID: a classifier of neuropeptide precursors , 2014, Nucleic Acids Res..

[12]  Zhu-Hong You,et al.  Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis , 2013, BMC Bioinformatics.

[13]  C. Borchers,et al.  Protein unfolding as a switch from self-recognition to high-affinity client binding , 2016, Nature Communications.

[14]  Bruce R. Southey,et al.  Prediction of neuropeptide cleavage sites in insects , 2008, Bioinform..

[15]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[16]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[17]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[18]  A Keith Dunker,et al.  TOP-IDP-scale: a new amino acid scale measuring propensity for intrinsic disorder. , 2008, Protein and peptide letters.

[19]  Qing Yu,et al.  A multi-scale strategy for discovery of novel endogenous neuropeptides in the crustacean nervous system. , 2013, Journal of proteomics.

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  J. Beckmann,et al.  FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. , 2005, Bioinformatics.

[22]  Michal Linial,et al.  ProFET: Feature engineering captures high-level protein functions , 2015, Bioinform..

[23]  Bruce R. Southey,et al.  Prediction of neuropeptide prohormone cleavages with application to RFamides , 2006, Peptides.

[24]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[25]  A. Tegge,et al.  Comparative analysis of neuropeptide cleavage sites in human, mouse, rat, and cattle , 2008, Mammalian Genome.

[26]  Michal Linial,et al.  Short Toxin-like Proteins Attack the Defense Line of Innate Immunity , 2013, Toxins.

[27]  S. Brunak,et al.  SignalP 4.0: discriminating signal peptides from transmembrane regions , 2011, Nature Methods.

[28]  I. Xenarios,et al.  UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. , 2016, Methods in molecular biology.

[29]  B. Dunn,et al.  Protein and peptide letters. , 2008, Protein and peptide letters.

[30]  Paul W. Sternberg,et al.  Ascaris suum draft genome , 2011, Nature.

[31]  W. Atchley,et al.  Solving the protein sequence metric problem. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[32]  S. Rigatti Random Forest. , 2017, Journal of insurance medicine.

[33]  Michal Linial,et al.  NeuroPID: a predictor for identifying neuropeptide precursors from metazoan proteomes , 2014, Bioinform..

[34]  Alexander G. Georgiev,et al.  Interpretable Numerical Descriptors of Amino Acid Space , 2009, J. Comput. Biol..

[35]  David T. Jones,et al.  DISOPRED3: precise disordered region predictions with annotated protein-binding activity , 2014, Bioinform..

[36]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[37]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[38]  Annik Prat,et al.  The biology and therapeutic targeting of the proprotein convertases , 2012, Nature Reviews Drug Discovery.

[39]  S. Brunak,et al.  Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites. , 2005, Glycobiology.

[40]  Nuno Bandeira,et al.  NeuroPedia: neuropeptide database and spectral library , 2011, Bioinform..

[41]  Kai Wang,et al.  Incorporating background frequency improves entropy-based residue conservation measures , 2006, BMC Bioinform..

[42]  Ashis Kumer Biswas,et al.  Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information , 2010, BMC Bioinformatics.

[43]  BMC Bioinformatics , 2005 .

[44]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[45]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[46]  Ignacio E. Sánchez,et al.  The eukaryotic linear motif resource ELM: 10 years and counting , 2013, Nucleic Acids Res..

[47]  Jianlin Cheng,et al.  A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[48]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[49]  Jonathan V Sweedler,et al.  Discovering new invertebrate neuropeptides using mass spectrometry. , 2006, Mass spectrometry reviews.

[50]  A. Strongin,et al.  High-Resolution Analysis and Functional Mapping of Cleavage Sites and Substrate Proteins of Furin in the Human Proteome , 2013, PloS one.

[51]  Zhiyong Lu,et al.  BioCreative III interactive task: an overview , 2011, BMC Bioinformatics.

[52]  Michal Linial,et al.  When Less Is More: Improving Classification of Protein Families with a Minimal Set of Global Features , 2007, WABI.

[53]  Yu-Dong Cai,et al.  Prediction of Protein Cleavage Site with Feature Selection by Random Forest , 2012, PloS one.

[54]  J. Veenstra,et al.  Mono- and dibasic proteolytic cleavage sites in insect neuroendocrine peptide precursors. , 2000, Archives of insect biochemistry and physiology.

[55]  Nikolaj Blom,et al.  Prediction of proprotein convertase cleavage sites. , 2004, Protein engineering, design & selection : PEDS.

[56]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[57]  Bruce R. Southey,et al.  NeuroPred: a tool to predict cleavage sites in neuropeptide precursors and provide the masses of the resulting peptides , 2006, Nucleic Acids Res..

[58]  Huaiyu Mi,et al.  The InterPro protein families database: the classification resource after 15 years , 2014, Nucleic Acids Res..

[59]  Kuldip K. Paliwal,et al.  Predicting backbone Cα angles and dihedrals from protein sequences by stacked sparse auto‐encoder deep neural network , 2014, J. Comput. Chem..

[60]  Eyal Gofer,et al.  Predicting proteolytic sites in extracellular proteins: only halfway there , 2008, Bioinform..

[61]  Jian Wang,et al.  NeuroPep: a comprehensive resource of neuropeptides , 2015, Database J. Biol. Databases Curation.

[62]  Tzong-Yi Lee,et al.  topPTM: a new module of dbPTM for identifying functional post-translational modifications in transmembrane proteins , 2013, Nucleic Acids Res..