Prediction of the O-glycosylation Sites in Protein by Layered Neural Networks and Support Vector Machines

O-glycosylation is one of the main types of the mammalian protein glycosylation, which is serine or threonine specific, though any consensus sequence is still unknown. In this paper, a layered neural network and a support vector machine are used for the prediction of O-glycosylation sites. Three types of encoding for a protein sequence within a fixed size window are used as the input to the network, that is, a sparse coding which distinguishes all 20 amino acid residues, 5-letter coding and hydropathy coding. In the neural network, one output unit gives the prediction whether a particular site of serine or threonine is glycosylated, while SVM classifies into the 2 classes. The performance is evaluated by the Matthews correlation coefficient. The preliminary results on the neural network show the better performance of the sparse and 5-letter codings compared with the hydropathy coding, while the improvement according to the window size is shown to be limited to a certain extent by SVM.