The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study.

Comparative genomics is a simple, powerful way to increase the accuracy of gene prediction. In this study, we show the utility of a simple test for the identification of protein-coding exons using human/mouse sequence comparisons. The test takes advantage of the fact that in the vast majority of coding regions, synonymous substitutions (K(S)) occur much more frequently than nonsynonymous ones (K(A)) and uses the K(A)/K(S) ratio as the criterion. We show the following: (1) most of the human and mouse exons are sufficiently long and have a suitable degree of sequence divergence for the test to perform reliably; (2) the test is suited for the identification of long exons and single exon genes, which are difficult to predict by current methods; (3) the test has a false-negative rate, lower than most of current gene prediction methods and a false-positive rate lower than all current methods; (4) the test has been automated and can be used in combination with other existing gene-prediction methods.

[1]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[2]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[3]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[4]  M. Boguski,et al.  Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5]  R. Durbin,et al.  Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. , 1999, Genome research.

[6]  Ziheng Yang,et al.  Statistical methods for detecting molecular adaptation , 2000, Trends in Ecology & Evolution.

[7]  Jill P. Mesirov,et al.  Human and mouse gene structure: comparative analysis and application to exon prediction , 2000, RECOMB '00.

[8]  W. J. Kent,et al.  Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. , 2000, Genome research.

[9]  I-Min A. Dubchak,et al.  Active conservation of noncoding sequences revealed by three-way species comparisons. , 2000, Genome research.

[10]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[11]  Alan K. Mackworth,et al.  Evaluation of gene-finding programs on mammalian sequences. , 2001, Genome research.

[12]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[13]  Ziheng Yang,et al.  Phylogenetic Analysis by Maximum Likelihood (PAML) , 2002 .