Feature Selection for the Prediction of Translation Initiation Sites

Translation initiation sites (TISs) are important signals in cDNA sequences. In many previous attempts to predict TISs in cDNA sequences, three major factors affect the prediction performance: the nature of the cDNA sequence sets, the relevant features selected, and the classification methods used. In this paper, we examine different approaches to select and integrate relevant features for TIS prediction. The top selected significant features include the features from the position weight matrix and the propensity matrix, the number of nucleotide C in the sequence downstream ATG, the number of downstream stop codons, the number of upstream ATGs, and the number of some amino acids, such as amino acids A and D. With the numerical data generated from these features, different classification methods, including decision tree, naïve Bayes, and support vector machine, were applied to three independent sequence sets. The identified significant features were found to be biologically meaningful, while the experiments showed promising results.

[1]  Huiqing Liu,et al.  Data Mining Tools for Biological Sequences , 2003, J. Bioinform. Comput. Biol..

[2]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  M. Kozak,et al.  Pushing the limits of the scanning mechanism for initiation of translation , 2002, Gene.

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[6]  A. Cigan,et al.  tRNAi(met) functions in directing the scanning ribosome to the start site of translation. , 1988, Science.

[7]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[8]  Anders Gorm Pedersen,et al.  Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome Analysis , 1997, ISMB.

[9]  Christian Derst,et al.  Prediction of human transnational initiation sites using a multiple neural network approach , 2000, Int. J. Comput. Syst. Signals.

[10]  Tze-Yun Leong,et al.  Translation initiation sites prediction with mixture Gaussian models in human cDNA sequences , 2005, IEEE Transactions on Knowledge and Data Engineering.

[11]  M. Kozak A consideration of alternative models for the initiation of translation in eukaryotes. , 1992, Critical reviews in biochemistry and molecular biology.

[12]  M. Kozak,et al.  How do eucaryotic ribosomes select initiation regions in messenger RNA? , 1978, Cell.

[13]  Limsoon Wong,et al.  Using feature generation and feature selection for accurate prediction of translation initiation sites. , 2002, Genome informatics. International Conference on Genome Informatics.

[14]  Tetsuo Nishikawa,et al.  Assessing protein coding region integrity in cDNA sequencing projects , 1998, Bioinform..

[15]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.

[16]  Qing Ji,et al.  Recognizing translation initiation sites of eukaryotic genes based on the cooperatively scanning model , 2003, Bioinform..

[17]  Artemis G. Hatzigeorgiou,et al.  Translation initiation start prediction in human cDNAs with high accuracy , 2002, Bioinform..

[18]  Steven Salzberg,et al.  A method for identifying splice sites and translational start sites in eukaryotic mRNA , 1997, Comput. Appl. Biosci..

[19]  Yonghong Wang,et al.  Recognition of Translation Initiation Sites of Eukaryotic Genes Based on an EM Algorithm , 2003, J. Comput. Biol..

[20]  Lynda B. M. Ellis,et al.  Comparison of computational methods for identifying translation initiation sites in EST data , 2004, BMC Bioinformatics.

[21]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[22]  M. Kozak The scanning model for translation: an update , 1989, The Journal of cell biology.

[23]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[24]  James W. Fickett,et al.  The Gene Identification Problem: An Overview for Developers , 1995, Comput. Chem..

[25]  D. McGeoch,et al.  On the predictive recognition of signal peptide sequences. , 1985, Virus research.

[26]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[27]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[28]  L. Wagner,et al.  21. UniGene: A Unified View of the Transcriptome , 2003 .

[29]  M. Kozak Point mutations define a sequence flanking the AUG initiator codon that modulates translation by eukaryotic ribosomes , 1986, Cell.

[30]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[31]  M. Kozak,et al.  At least six nucleotides preceding the AUG initiator codon enhance translation in mammalian cells. , 1987, Journal of molecular biology.

[32]  Luciano Milanesi,et al.  Presence of ATG triplets in 5' untranslated regions of eukaryotic cDNAs correlates with a 'weak' context of the start codon , 2001, Bioinform..

[33]  M. Kozak An analysis of 5'-noncoding sequences from 699 vertebrate messenger RNAs. , 1987, Nucleic acids research.

[34]  Huiqing Liu,et al.  Using amino acid patterns to accurately predict translation initiation sites , 2004, Silico Biol..

[35]  Tetsuo Nishikawa,et al.  Prediction whether a human cDNA sequence contains initiation codon by combining statistical information and similarity with protein sequences , 2000, Bioinform..