论文信息 - Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction

Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction

Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns.

[1] Biing-Hwang Juang,et al. Hidden Markov Models for Speech Recognition , 1991 .

[2] E. Snyder,et al. Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[3] R. Guigó,et al. Evaluation of gene structure prediction programs. , 1996, Genomics.

[4] P. Pevzner,et al. Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[5] David Haussler,et al. A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[6] Michael Ruogu Zhang,et al. Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[7] Anders Krogh,et al. Two Methods for Improving Performance of a HMM and their Application for Gene Finding , 1997, ISMB.

[8] S. Karlin,et al. Finding the genes in genomic DNA. , 1998, Current opinion in structural biology.

[9] S. Salzberg,et al. Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[10] A. Fedorov,et al. Influence of Exon Duplication on Intron and Exon Phase Distribution , 1998, Journal of Molecular Evolution.

[11] A. Krogh. 11 – Gene Finding: Putting the Parts Together , 1998 .

[12] R. Guigó,et al. An assessment of gene prediction accuracy in large DNA sequences. , 2000, Genome research.

[13] Alan K. Mackworth,et al. Evaluation of gene-finding programs on mammalian sequences. , 2001, Genome research.

[14] Ian Korf,et al. Integrating genomic homology into gene structure prediction , 2001, ISMB.

[15] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16] P. Rouzé,et al. Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.