Predicting protein sumoylation sites from sequence features

Protein sumoylation is a post-translational modification that plays an important role in a wide range of cellular processes. Small ubiquitin-related modifier (SUMO) can be covalently and reversibly conjugated to the sumoylation sites of target proteins, many of which are implicated in various human genetic disorders. The accurate prediction of protein sumoylation sites may help biomedical researchers to design their experiments and understand the molecular mechanism of protein sumoylation. In this study, a new machine learning approach has been developed for predicting sumoylation sites from protein sequence information. Random forests (RFs) and support vector machines (SVMs) were trained with the data collected from the literature. Domain-specific knowledge in terms of relevant biological features was used for input vector encoding. It was shown that RF classifier performance was affected by the sequence context of sumoylation sites, and 20 residues with the core motif ΨKXE in the middle appeared to provide enough context information for sumoylation site prediction. The RF classifiers were also found to outperform SVM models for predicting protein sumoylation sites from sequence features. The results suggest that the machine learning approach gives rise to more accurate prediction of protein sumoylation sites than the other existing methods. The accurate classifiers have been used to develop a new web server, called seeSUMO (http://bioinfo.ggc.org/seesumo/), for sequence-based prediction of protein sumoylation sites.

[1]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[2]  Jeremy M. Henley,et al.  Emerging extranuclear roles of protein SUMOylation in neuronal function and dysfunction , 2007, Nature Reviews Neuroscience.

[3]  Minoru Kanehisa,et al.  AAindex: Amino Acid index database , 2000, Nucleic Acids Res..

[4]  Shandar Ahmad,et al.  Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information , 2004, Bioinform..

[5]  John M. Walker,et al.  The Proteomics Protocols Handbook , 2005, Humana Press.

[6]  Howard Leung,et al.  Prediction of membrane protein types from sequences and position-specific scoring matrices. , 2007, Journal of theoretical biology.

[7]  J. Zhao,et al.  Sumoylation regulates diverse biological processes , 2007, Cellular and Molecular Life Sciences.

[8]  Liangjiang Wang,et al.  Sequence feature-based prediction of protein stability changes upon amino acid substitutions , 2010, BMC Genomics.

[9]  C. Hoogland,et al.  In The Proteomics Protocols Handbook , 2005 .

[10]  Shandar Ahmad,et al.  PSSM-based prediction of DNA binding sites in proteins , 2005, BMC Bioinformatics.

[11]  Jack Y. Yang,et al.  BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features , 2010, BMC Systems Biology.

[12]  Florian Gnad,et al.  Site-specific identification of SUMO-2 targets in cells reveals an inverted SUMOylation motif and a hydrophobic cluster SUMOylation motif. , 2010, Molecular cell.

[13]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[14]  F. Melchior,et al.  Concepts in sumoylation: a decade on , 2007, Nature Reviews Molecular Cell Biology.

[15]  Liangjiang Wang,et al.  BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences , 2006, Nucleic Acids Res..

[16]  Yun He,et al.  A novel method for high accuracy sumoylation site prediction from protein sequences , 2008, BMC Bioinformatics.

[17]  Yu Xue,et al.  Systematic study of protein sumoylation: Development of a site‐specific predictor of SUMOsp 2.0 , 2009, Proteomics.

[18]  Susan J. Brown,et al.  Prediction of RNA-Binding Residues in Protein Sequences Using Support Vector Machines , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[19]  William Stafford Noble,et al.  Support vector machine , 2013 .

[20]  A. Sharrocks,et al.  An extended consensus motif enhances the specificity of substrate modification by SUMO , 2006, The EMBO journal.

[21]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[22]  Yu Xue,et al.  SUMOsp: a web server for sumoylation site prediction , 2006, Nucleic Acids Res..

[23]  P. Pandolfi,et al.  SUMO Modification of Huntingtin and Huntington's Disease Pathology , 2004, Science.

[24]  L. Sistonen,et al.  PDSM, a motif for phosphorylation-dependent SUMO modification. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Gary D. Stormo,et al.  Displaying the information contents of structural RNA alignments: the structure logos , 1997, Comput. Appl. Biosci..

[26]  A. Dejean,et al.  An Acetylation/Deacetylation-SUMOylation Switch through a Phylogenetically Conserved ψKXEP Motif in the Tumor Suppressor HIC1 Regulates Transcriptional Repression Activity , 2007, Molecular and Cellular Biology.

[27]  K. Sarge,et al.  Sumoylation and human disease pathogenesis. , 2009, Trends in biochemical sciences.

[28]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.