A two-stage method for O-glycosylation site prediction

Abstract Correctly predicting the site of O-glycosylation will greatly benefit the search and design of new specific and efficient GalNAc-transferase inhibitors. In this article, the site of O-glycosylation was studied using the correlation-based feature subset (CfsSubset) selection method combined with a wrapper method. Twenty-three important biochemical features were found based on a jackknife test from original data set containing 4779 features. By using the AdaBoost method with the twenty-three selected features, the prediction model yields an accuracy rate of 88.1% for the jackknife test and 87.5% for an independent set test, with increased accuracy over the original dataset by 8.5% and 10.42%, respectively. It is expected that our feature selection scheme can be referred to as a useful assistant technique for finding effective competitive inhibitors of GalNAc-transferase. An online predictor based on this research is available at http://chemdata.shu.edu.cn/gal_p/ .

[1]  O. Lund,et al.  Prediction of O-glycosylation of mammalian proteins: specificity patterns of UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase. , 1995, The Biochemical journal.

[2]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[3]  Kuo-Chen Chou,et al.  Fuzzy KNN for predicting membrane protein types from pseudo-amino acid composition. , 2006, Journal of theoretical biology.

[4]  J. Wieruszeski,et al.  The Glycan Moiety of Human Pancreatic Lithostathine , 1995 .

[5]  Kuo-Chen Chou,et al.  Using pseudo amino acid composition to predict protein structural classes: Approached with complexity measure factor , 2006, J. Comput. Chem..

[6]  David P. Helmbold,et al.  A geometric approach to leveraging weak learners , 2002, Theor. Comput. Sci..

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[8]  Kuo-Chen Chou,et al.  Computational methods for protein-protein interaction and their application. , 2005, Current protein & peptide science.

[9]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Kuo-Chen Chou,et al.  Peptide reagent design based on physical and chemical properties of amino acid residues , 2007, J. Comput. Chem..

[11]  K. Chou,et al.  A sequence‐coupled vector‐projection model for predicting the specificity of GalNAc‐transferase , 1995, Protein science : a publication of the Protein Society.

[12]  L. K. Rasmussen,et al.  The primary structure of caprine PP3: amino acid sequence, phosphorylation, and glycosylation of component PP3 from the proteose-peptone fraction of caprine milk. , 1998, Journal of dairy science.

[13]  L. Tabak,et al.  Isoform-specific O-glycosylation by murine UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase-T3, in vivo. , 1998, Glycobiology.

[14]  Kuo-Chen Chou,et al.  Heuristic molecular lipophilicity potential (HMLP): A 2D‐QSAR study to LADH of molecular family pyrazole and derivatives , 2005, J. Comput. Chem..

[15]  Richard E. Korf,et al.  Best-First Minimax Search , 1996, Artif. Intell..

[16]  R. Poorman,et al.  The specificity of UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase as inferred from a database of in vivo substrates and from the in vitro glycosylation of proteins and peptides. , 1993, The Journal of biological chemistry.

[17]  Lin Lu,et al.  Prediction of interaction between small molecule and enzyme using AdaBoost , 2009, Molecular Diversity.

[18]  Kuo-Chen Chou,et al.  Using supervised fuzzy clustering to predict protein structural classes. , 2005, Biochemical and biophysical research communications.

[19]  James A. Mackintosh,et al.  Isolation from an Ant Myrmecia gulosa of Two Inducible O-Glycosylated Proline-rich Antibacterial Peptides* , 1998, The Journal of Biological Chemistry.

[20]  K. Chou,et al.  Artificial Neural Network Method for Predicting the Specificity of GalNAc-transferase , 1997, Journal of protein chemistry.

[21]  S. Brunak,et al.  Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites. , 2005, Glycobiology.

[22]  Guo-Zheng Li,et al.  Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins , 2008, Molecular Diversity.

[23]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[24]  Kuo-Chen Chou,et al.  Methodology development for predicting subcellular localization and other attributes of proteins , 2007, Expert review of proteomics.

[25]  K. Chou,et al.  A vector projection method for predicting the specificity of GalNAc‐transferase , 1995, Proteins.

[26]  Wencong Lu,et al.  Predicting toxic action mechanisms of phenols using AdaBoost Learner , 2009 .

[27]  Enrique Romero,et al.  Margin maximization with feed-forward neural networks: a comparative study with SVM and AdaBoost , 2004, Neurocomputing.

[28]  H. Klenk,et al.  Molecular characterization of gp40, a mucin-type glycoprotein from the apical plasma membrane of Madin-Darby canine kidney cells (type I). , 1997, The Biochemical journal.

[29]  Lloyd A. Smith,et al.  Practical feature subset selection for machine learning , 1998 .

[30]  Nathan Sharon,et al.  Glycoproteins: research booming on long-ignored, ubiquitous compounds , 1981 .

[31]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[32]  M. Kurimoto,et al.  O-glycosylated species of natural human tumor-necrosis factor-alpha. , 1996, European journal of biochemistry.

[33]  Dapeng Li,et al.  Heuristic molecular lipophilicity potential (HMLP): Lipophilicity and hydrophilicity of amino acid side chains , 2006, J. Comput. Chem..

[34]  L. Tabak,et al.  Separation of glycopeptides from in vitro O-glycosylation reactions using C18 cartridges. , 1993, Analytical biochemistry.

[35]  Kuo-Chen Chou,et al.  Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition. , 2005, Biochemical and biophysical research communications.

[36]  Minoru Kanehisa,et al.  PLOC: Prediction of Subcellular Location of Proteins , 2003 .

[37]  Kuo-Chen Chou,et al.  Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo-amino acid composition to predict membrane protein types. , 2005, Biochemical and biophysical research communications.

[38]  G. Hart,et al.  Nuclear and cytoplasmic glycosylation: novel saccharide linkages in unexpected places. , 1988, Trends in biochemical sciences.

[39]  John L. Rhodes,et al.  Algebraic Principles for the Analysis of a Biochemical System , 1967, J. Comput. Syst. Sci..

[40]  K. Chou,et al.  Knowledge-based model building of the tertiary structures for lectin domains of the selectin family , 1996, Journal of protein chemistry.

[41]  Kuo-Chen Chou,et al.  Predicting protein structural class with AdaBoost Learner. , 2006, Protein and peptide letters.

[42]  A. Varki,et al.  Biological roles of oligosaccharides: all of the theories are correct , 1993, Glycobiology.

[43]  K C Chou,et al.  Artificial neural network model for predicting the specificity of GalNAc-transferase. , 1996, Analytical biochemistry.

[44]  J. Wieruszeski,et al.  The glycan moiety of human pancreatic lithostathine. Structure characterization and possible pathophysiological implications. , 1995, European journal of biochemistry.

[45]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Kuo-Chen Chou,et al.  Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network. , 2007, Protein and peptide letters.

[47]  Lin Lu,et al.  HIV‐1 protease cleavage site prediction based on amino acid property , 2009, J. Comput. Chem..

[48]  L. Tabak,et al.  Charge distribution of flanking amino acids inhibits O-glycosylation of several single-site acceptors in vivo. , 1997, Glycobiology.

[49]  K. Chou A novel approach to predicting protein structural classes in a (20–1)‐D amino acid composition space , 1995, Proteins.

[50]  M. Kanehisa,et al.  Cluster analysis of amino acid indices for prediction of protein structure and function. , 1988, Protein engineering.

[51]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[52]  H.-B. Shen,et al.  Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction , 2007, Amino Acids.

[53]  H. Ikenaga,et al.  Discovery of the Shortest Sequence Motif for High Level Mucin-type O-Glycosylation* , 1997, The Journal of Biological Chemistry.

[54]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[55]  E. Hounsell,et al.  O-linked protein glycosylation structure and function , 1996, Glycoconjugate Journal.

[56]  L. Tabak,et al.  The influence of flanking sequence on the O-glycosylation of threonine in vitro. , 1992, The Journal of biological chemistry.

[57]  K. Chou,et al.  Support vector machines for predicting the specificity of GalNAc-transferase , 2002, Peptides.