论文信息 - A self-training semi-supervised support vector machine method for recognizing transcription start sites

A self-training semi-supervised support vector machine method for recognizing transcription start sites

The task of finding transcription start sites (TSSs) can be modeled as a classification problem. Semi-Supervised Support Vector Machines (S3VMs) are an appealing method for using unlabeled data in classification. Based incorporation prior biological knowledge for recognizing TSSs, propose a Self-Training S3VMs (ST-S3VMs) algorithm. ST-S3VM builds a SVM classifier based small amounts of labeled data and large amounts of unlabeled data, incorporates prior biological knowledge by engineering an appropriate kernel function with a self-training algorithm The algorithm has been implemented and tested on previously published data. Our experimental results on real nucleotide sequences data show that our method improve the prediction accuracy greatly and our method performs significantly better than ESTSCAN and SVMs with Salzberg kernel.

[1] Gunnar Rätsch,et al. Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[2] Ayhan Demiriz,et al. Semi-Supervised Support Vector Machines , 1998, NIPS.

[3] S. Karlin,et al. Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[4] Steen Knudsen,et al. Promoter2.0: for the recognition of PolII promoter sequences , 1999, Bioinform..

[5] Anders Gorm Pedersen,et al. Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome Analysis , 1997, ISMB.

[6] Matthias Seeger,et al. Learning from Labeled and Unlabeled Data , 2010, Encyclopedia of Machine Learning.

[8] Jean-Michel Claverie,et al. Detection of Eukaryotic Promoters Using Markov Transition Matrices , 1997, Comput. Chem..

[9] Bernhard E. Boser,et al. A training algorithm for optimal margin classifiers , 1992, COLT '92.

[10] P. Bucher. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. , 1990, Journal of molecular biology.