Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements

BackgroundThe identification of protein coding elements in sets of mammalian conserved elements is one of the major challenges in the current molecular biology research. Many features have been proposed for automatically distinguishing coding and non coding conserved sequences, making so necessary a systematic statistical assessment of their differences. A comprehensive study should be composed of an association study, i.e. a comparison of the distributions of the features in the two classes, and a prediction study in which the prediction accuracies of classifiers trained on single and groups of features are analyzed, conditionally to the compared species and to the sequence lengths.ResultsIn this paper we compared distributions of a set of comparative and non comparative features and evaluated the prediction accuracy of classifiers trained for discriminating sequence elements conserved among human, mouse and rat species. The association study showed that the analyzed features are statistically different in the two classes. In order to study the influence of the sequence lengths on the feature performances, a predictive study was performed on different data sets composed of coding and non coding alignments in equal number and equally long with an ascending average length. We found that the most discriminant feature was a comparative measure indicating the proportion of synonymous nucleotide substitutions per synonymous sites. Moreover, linear discriminant classifiers trained by using comparative features in general outperformed classifiers based on intrinsic ones. Finally, the prediction accuracy of classifiers trained on comparative features increased significantly by adding intrinsic features to the set of input variables, independently on sequence length (Kolmogorov-Smirnov P-value ≤ 0.05).ConclusionWe observed distinct and consistent patterns for individual and combined use of comparative and intrinsic classifiers, both with respect to different lengths of sequences/alignments and with respect to error rates in the classification of coding and non-coding elements. In particular, we noted that comparative features tend to be more accurate in the classification of coding sequences – this is likely related to the fact that such features capture deviations from strictly neutral evolution expected as a consequence of the characteristics of the genetic code.

[1]  David Haussler,et al.  Patterns of insertions and their covariation with substitutions in the rat, mouse, and human genomes. , 2004, Genome research.

[2]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[3]  Sayan Mukherjee,et al.  Estimating Dataset Size Requirements for Classifying DNA Microarray Data , 2003, J. Comput. Biol..

[4]  S. Jeffery Evolution of Protein Molecules , 1979 .

[5]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[6]  Austen R. D. Ganley,et al.  Phylogenetic footprinting to find functional DNA elements. , 2007, Methods in molecular biology.

[7]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[8]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[9]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[10]  Colin N. Dewey,et al.  Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures , 2007, Nature.

[11]  P.D. Cristea,et al.  Genomic signal processing , 2004, 7th Seminar on Neural Network Applications in Electrical Engineering, 2004. NEUREL 2004. 2004.

[12]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[13]  G. Bernardi,et al.  The compositional properties of human genes , 1991, Journal of Molecular Evolution.

[14]  T. W. Anderson,et al.  An Introduction to Multivariate Statistical Analysis , 1959 .

[15]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[16]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[17]  J. Schmee An Introduction to Multivariate Statistical Analysis , 1986 .

[18]  Graziano Pesole,et al.  Computational identification of protein coding potential of conserved sequence tags through cross-species evolutionary analysis. , 2003, Nucleic acids research.

[19]  Graziano Pesole,et al.  CSTminer: a web tool for the identification of coding and noncoding conserved sequence tags through cross-species genome comparison , 2004, Nucleic Acids Res..

[20]  Wen-Hsiung Li,et al.  The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study. , 2002, Genome research.

[21]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[22]  Manolis Kellis,et al.  Performance and Scalability of Discriminative Metrics for Comparative Gene Identification in 12 Drosophila Genomes , 2008, PLoS Comput. Biol..

[23]  David Haussler,et al.  Comparative recombination rates in the rat, mouse, and human genomes. , 2004, Genome research.

[24]  Graziano Pesole,et al.  GenoMiner: a tool for genome-wide search of coding and non-coding conserved sequence tags , 2006, Bioinform..

[25]  Feng Gao,et al.  Comparison of various algorithms for recognizing short coding sequences of human genes , 2004, Bioinform..

[26]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[27]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[28]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[29]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[30]  Lisa M. D'Souza,et al.  Genome sequence of the Brown Norway rat yields insights into mammalian evolution , 2004, Nature.

[31]  Debashis Kushary,et al.  Bootstrap Methods and Their Application , 2000, Technometrics.

[32]  Gregory R. Grant,et al.  Statistical Methods in Bioinformatics , 2001 .

[33]  G. Olsen,et al.  CRITICA: coding region identification tool invoking comparative analysis. , 1999, Molecular biology and evolution.

[34]  M. Nei,et al.  Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. , 1986, Molecular biology and evolution.

[35]  Douglas A. Wolfe,et al.  Nonparametric Statistical Methods , 1973 .

[36]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[37]  Elena Rivas,et al.  Noncoding RNA gene detection using comparative sequence analysis , 2001, BMC Bioinformatics.

[38]  Y. Sakaki,et al.  A novel index which precisely derives protein coding regions from cross-species genome alignments. , 2002, Genome informatics. International Conference on Genome Informatics.

[39]  M. Bibb,et al.  The relationship between base composition and codon usage in bacterial genes and its use for the simple and reliable identification of protein-coding sequences. , 1984, Gene.

[40]  Bonnie Berger,et al.  Methods in Comparative Genomics: Genome Correspondence, Gene Identification and Regulatory Motif Discovery , 2004, J. Comput. Biol..

[41]  B. Rost,et al.  Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines , 2006, PLoS genetics.