Improving the Accuracy of Classifiers for the Prediction of Translation Initiation Sites in Genomic Sequences

The prediction of the Translation Initiation Site (TIS) in a genomic sequence is an important issue in biological research. Although several methods have been proposed to deal with this problem, there is a great potential for the improvement of the accuracy of these methods. Due to various reasons, including noise in the data as well as biological reasons, TIS prediction is still an open problem and definitely not a trivial task. In this paper we follow a three-step approach in order to increase TIS prediction accuracy. In the first step, we use a feature generation algorithm we developed. In the second step, all the candidate features, including some new ones generated by our algorithm, are ranked according to their impact to the accuracy of the prediction. Finally, in the third step, a classification model is built using a number of the top ranked features. We experiment with various feature sets, feature selection methods and classification algorithms, compare with alternative methods, draw important conclusions and propose improved models with respect to prediction accuracy.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[3]  A. Shatkin,et al.  Migration of 40 S ribosomal subunits on messenger RNA in the presence of edeine. , 1978, The Journal of biological chemistry.

[4]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[5]  Huiqing Liu,et al.  Using amino acid patterns to accurately predict translation initiation sites , 2004, Silico Biol..

[6]  M. Kozak An analysis of 5'-noncoding sequences from 699 vertebrate messenger RNAs. , 1987, Nucleic acids research.

[7]  Anders Gorm Pedersen,et al.  Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome Analysis , 1997, ISMB.

[8]  M. Kozak The scanning model for translation: an update , 1989, The Journal of cell biology.

[9]  Huiqing Liu,et al.  Data Mining Tools for Biological Sequences , 2003, J. Bioinform. Comput. Biol..

[10]  Tetsuo Nishikawa,et al.  Assessing protein coding region integrity in cDNA sequencing projects , 1998, Bioinform..

[11]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[12]  M. Kozak Initiation of translation in prokaryotes and eukaryotes. , 1999, Gene.

[13]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[14]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[15]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[16]  A. Pandey,et al.  A reassessment of the translation initiation codon in vertebrates. , 2001, Trends in genetics : TIG.

[17]  Artemis G. Hatzigeorgiou,et al.  Translation initiation start prediction in human cDNAs with high accuracy , 2002, Bioinform..

[18]  Tetsuo Nishikawa,et al.  Prediction whether a human cDNA sequence contains initiation codon by combining statistical information and similarity with protein sequences , 2000, Bioinform..

[19]  Limsoon Wong,et al.  Using feature generation and feature selection for accurate prediction of translation initiation sites. , 2002, Genome informatics. International Conference on Genome Informatics.