A self-training semi-supervised support vector machine method for recognizing transcription start sites

The task of finding transcription start sites (TSSs) can be modeled as a classification problem. Semi-Supervised Support Vector Machines (S3VMs) are an appealing method for using unlabeled data in classification. Based incorporation prior biological knowledge for recognizing TSSs, propose a Self-Training S3VMs (ST-S3VMs) algorithm. ST-S3VM builds a SVM classifier based small amounts of labeled data and large amounts of unlabeled data, incorporates prior biological knowledge by engineering an appropriate kernel function with a self-training algorithm The algorithm has been implemented and tested on previously published data. Our experimental results on real nucleotide sequences data show that our method improve the prediction accuracy greatly and our method performs significantly better than ESTSCAN and SVMs with Salzberg kernel.