Sequence- and structure-based prediction of amyloidogenic regions in proteins

Machine learning methods are increasingly used in proteomics research, especially in analyzing and predicting protein structures, functions, subcellular localizations and interactions. However, much research in recent years has focused on protein misfolding problem and the impact of unfolded and defective proteins on cell dysfunction, due to its considerable importance for molecular medicine. These abnormal proteins degradation and deposition often result in the formation of certain plaque cores among them the so-called amyloid fibrils which are responsible for an increasing number of highly debilitating disorders in humans. Yet, a significant challenge remains, especially in understanding the underlying causes and major risk factors of these harmful deposits in vital organs and tissues. This paper explores the potential of string kernel-based support vector machines in the prediction of amyloidogenic regions in proteins by incorporating the most informative features of the protein sequence such as predicted secondary structure and solvent accessibility, with a special focus on $$\alpha $$ α -helical conformations which seem to be primarily concerned with amyloidogenesis. The performances compared with the most popular methods on Pep424 and Reg33 benchmark datasets indicate the robustness of the predictive model. Furthermore, the results showed accurate prediction of regions promoting fibrillogenesis for experimentally determined amyloid proteins and revealed that the five amino acids Leucine, Glycine, Alanine, Valine and Serine are predominantly present in amyloid-prone regions and confirm that the core regions of an amyloid aggregate are not necessarily fully buried.

[1]  Pawel Gasior,et al.  FISH Amyloid – a new method for finding amyloidogenic segments in proteins based on site specific co-occurence of aminoacids , 2014, BMC Bioinformatics.

[2]  Michele Vendruscolo,et al.  The CamSol method of rational design of protein mutants with enhanced solubility. , 2015, Journal of molecular biology.

[3]  S. Hamodrakas Protein aggregation and amyloid fibril formation prediction software from primary sequence: towards controlling the formation of bacterial inclusion bodies , 2011, The FEBS journal.

[4]  Tuomas P. J. Knowles,et al.  The amyloid state and its association with protein misfolding diseases , 2014, Nature Reviews Molecular Cell Biology.

[5]  Michele Vendruscolo,et al.  Theoretical approaches to protein aggregation. , 2006, Protein and peptide letters.

[6]  Rafael Zambrano,et al.  AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures , 2015, Nucleic Acids Res..

[7]  Seong-Gon Kim,et al.  Protein Secondary Structure Prediction using Multiple Neural Network Likelihood Models , 2010, Int. J. Fuzzy Log. Intell. Syst..

[8]  C. Dobson Protein misfolding, evolution and disease. , 1999, Trends in biochemical sciences.

[9]  David Eisenberg,et al.  Identifying the amylome, proteins capable of forming amyloid-like fibrils , 2010, Proceedings of the National Academy of Sciences.

[10]  Ronald Wetzel,et al.  A serendipitous survey of prediction algorithms for amyloidogenicity. , 2013, Biopolymers.

[11]  Chunyu Wang,et al.  RFAmyloid: A Web Server for Predicting Amyloid Proteins , 2018, International journal of molecular sciences.

[12]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[13]  Katerina C. Nastou,et al.  Mining databases for protein aggregation: a review , 2017, Amyloid : the international journal of experimental and clinical investigation : the official journal of the International Society of Amyloidosis.

[14]  P. Lansbury,et al.  Amyloid fibrillogenesis: themes and variations. , 2000, Current opinion in structural biology.

[15]  Andrey V. Kajava,et al.  A structure-based approach to predict predisposition to amyloidosis , 2015, Alzheimer's & Dementia.

[16]  C. Dobson,et al.  Protein misfolding, functional amyloid, and human disease. , 2006, Annual review of biochemistry.

[17]  Michail Yu. Lobanov,et al.  FoldAmyloid: a method of prediction of amyloidogenic regions from protein sequence , 2010, Bioinform..

[18]  M. Vendruscolo,et al.  The Zyggregator method for predicting protein aggregation propensities. , 2008, Chemical Society reviews.

[19]  Michele Vendruscolo,et al.  Prediction of "aggregation-prone" and "aggregation-susceptible" regions in proteins associated with neurodegenerative diseases. , 2005, Journal of molecular biology.

[20]  Ingrid G. Abfalter,et al.  Complex Networks Govern Coiled-Coil Oligomerization – Predicting and Profiling by Means of a Machine Learning Approach , 2011, Molecular & Cellular Proteomics.

[21]  Silvio C. E. Tosatto,et al.  The PASTA server for protein aggregation prediction. , 2007, Protein engineering, design & selection : PEDS.

[22]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[23]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[24]  M. Lill,et al.  Fibpredictor: a computational method for rapid prediction of amyloid fibril structures , 2016, Journal of Molecular Modeling.

[25]  Yi-Cheng Chen Impact of a discordant helix on β-amyloid structure, aggregation ability and toxicity , 2017, European Biophysics Journal.

[26]  Michele Vendruscolo,et al.  Rapid and accurate in silico solubility screening of a monoclonal antibody library , 2017, Scientific Reports.

[27]  Lenore Cowen,et al.  BETASCAN: Probable β-amyloids Identified by Pairwise Probabilistic Analysis , 2009, PLoS Comput. Biol..

[28]  A. Giuliani,et al.  A computational approach identifies two regions of Hepatitis C Virus E1 protein as interacting domains involved in viral fusion process , 2009, BMC Structural Biology.

[29]  C. Dobson The structural basis of protein folding and its links with human disease. , 2001, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[30]  Jakub Nalepa,et al.  Selecting training sets for support vector machines: a review , 2018, Artificial Intelligence Review.

[31]  Amedeo Caflisch,et al.  Prediction of aggregation rate and aggregation‐prone segments in polypeptide sequences , 2005, Protein science : a publication of the Protein Society.

[32]  J. Shorter,et al.  Amyloid assembly and disassembly , 2018, Journal of Cell Science.

[33]  A. Szymańska,et al.  Human cystatin C monomer, dimer, oligomer, and amyloid structures are related to health and disease , 2016, FEBS letters.

[34]  Exploring the sequence determinants of amyloid structure using position-specific scoring matrices , 2010, Nature Methods.

[35]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[36]  C. Dobson,et al.  Rationalization of the effects of mutations on peptide andprotein aggregation rates , 2003, Nature.

[37]  Sandeep Kumar,et al.  GAP: towards almost 100 percent prediction for β-strand-mediated aggregating peptides with distinct morphologies , 2014, Bioinform..

[38]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[39]  M. Oliveberg Waltz, an exciting new move in amyloid prediction , 2010, Nature Methods.

[40]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[41]  A. Doig,et al.  Inhibitors of protein aggregation and toxicity. , 2009, Biochemical Society transactions.

[42]  Fabrizio Chiti,et al.  Prevention of amyloid‐like aggregation as a driving force of protein evolution , 2007, EMBO reports.

[43]  David A. Phoenix,et al.  Prediction of Peptide and Protein Propensity for Amyloid Formation , 2014, PloS one.

[44]  Pierre Baldi,et al.  SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity , 2014, Bioinform..

[45]  Andrzej Kloczkowski,et al.  The GOR Method of Protein Secondary Structure Prediction and Its Application as a Protein Aggregation Prediction Tool. , 2017, Methods in molecular biology.

[46]  P. Sobczyk,et al.  Amyloidogenic motifs revealed by n-gram analysis , 2017, Scientific Reports.

[47]  Salvador Ventura,et al.  Prediction of "hot spots" of aggregation in disease-linked polypeptides , 2005, BMC Structural Biology.

[48]  Peter Tompa,et al.  AmyPro: a database of proteins with validated amyloidogenic regions , 2017, Nucleic Acids Res..

[49]  Stavros J Hamodrakas,et al.  Consensus prediction of amyloidogenic determinants in amyloid fibril-forming proteins. , 2007, International journal of biological macromolecules.

[50]  Salvador Ventura,et al.  Short amino acid stretches can mediate amyloid formation in globular proteins: the Src homology 3 (SH3) case. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[51]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[52]  Ulrich Bodenhofer,et al.  KeBABS: an R package for kernel-based analysis of biological sequences , 2015, Bioinform..

[53]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[54]  Mingchen Chen,et al.  The AWSEM-Amylometer: predicting amyloid propensity and fibril topology using an optimized folding landscape model , 2017, bioRxiv.

[55]  William J Welsh,et al.  Detecting hidden sequence propensity for amyloid fibril formation , 2004, Protein science : a publication of the Protein Society.

[56]  Malgorzata Kotulska,et al.  AmyLoad: website dedicated to amyloidogenic protein fragments , 2015, Bioinform..

[57]  Hao Chen,et al.  Identification of amyloid fibril-forming segments based on structure and residue-based statistical potential , 2007, Bioinform..

[58]  Louise C. Serpell,et al.  A simple algorithm locates β‐strands in the amyloid fibril core of α‐synuclein, Aβ, and tau using the amino acid sequence alone , 2007 .

[59]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[60]  Srinivas Devadas,et al.  A method for probing the mutational landscape of amyloid structure , 2011, Bioinform..

[61]  P. Y. Chou,et al.  Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins. , 1974, Biochemistry.

[62]  C. Dobson,et al.  Protein Misfolding, Amyloid Formation, and Human Disease: A Summary of Progress Over the Last Decade. , 2017, Annual review of biochemistry.

[63]  Kyoung-Joung Lee,et al.  Design of a Fast Learning Classifier for Sleep Apnea Database based on Fuzzy SVM , 2017, Int. J. Fuzzy Log. Intell. Syst..

[64]  C. Ross,et al.  Protein aggregation and neurodegenerative disease , 2004, Nature Medicine.

[65]  L. Serrano,et al.  Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins , 2004, Nature Biotechnology.

[66]  Christopher J Roberts,et al.  Driving Forces for Nonnative Protein Aggregation and Approaches to Predict Aggregation-Prone Regions. , 2017, Annual review of chemical and biomolecular engineering.

[67]  Anthony Talvas,et al.  MetAmyl: A METa-Predictor for AMYLoid Proteins , 2013, PloS one.

[68]  Y. Kallberg,et al.  Prediction of Amyloid Fibril-forming Proteins* , 2001, The Journal of Biological Chemistry.

[69]  HuangYing,et al.  CD-HIT Suite , 2010 .

[70]  Stavros J. Hamodrakas,et al.  A Consensus Method for the Prediction of ‘Aggregation-Prone’ Peptides in Globular Proteins , 2013, PloS one.

[71]  D. Baker,et al.  The 3D profile method for identifying fibril-forming segments of proteins. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[72]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[73]  Jun Guo,et al.  Prediction of amyloid fibril-forming segments based on a support vector machine , 2009, BMC Bioinformatics.

[74]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[75]  D. Otzen,et al.  We find them here, we find them there: Functional bacterial amyloid , 2008, Cellular and Molecular Life Sciences.

[76]  Bonnie Berger,et al.  STITCHER: Dynamic assembly of likely amyloid and prion β-structures from secondary structure predictions , 2011, Proteins.

[77]  Jie Chen,et al.  Fibril-Forming Motifs Are Essential and Sufficient for the Fibrillization of Human Tau , 2012, PloS one.

[78]  Francesc X. Avilés,et al.  AGGRESCAN: a server for the prediction and evaluation of "hot spots" of aggregation in polypeptides , 2007, BMC Bioinform..

[79]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[80]  Michail Yu. Lobanov,et al.  Prediction of Amyloidogenic and Disordered Regions in Protein Chains , 2006, PLoS Comput. Biol..

[81]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[82]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[83]  Jiwon Choi,et al.  NetCSSP: web application for predicting chameleon sequences and amyloid fibril formation , 2009, Nucleic Acids Res..