Simple Methods of Finding Short Protein Coding Sequences

Eukaryotic genomes contain many conserved regions of unknown function. Accurately assessing the protein coding potential of these regions is a key step in annotation. We develop three protein coding measures that directly assess conserved regions in multiple sequence alignments of many species: one based on phase-shifts induced by alignment gaps, another based on the 3rd position mutation asymmetry in codons, and a third based on nucleotide composition asymmetry. The methods are easy to implement and require no training. Using a human-chimp-rat-mouse-chicken multiple alignment, these measures can classify coding regions as short as 30nt with greater specificity than single-genome measures using 120nt. Results from human-mouse and humanchicken alignments can be further improved by considering additional species; only the chimp genome proved uninformative. The phase-shift method is especially accurate. Contact: agh@pcbi.upenn.edu, hannoh@seas.upenn.edu

[1]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[2]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[3]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[4]  Graziano Pesole,et al.  Computational identification of protein coding potential of conserved sequence tags through cross-species evolutionary analysis. , 2003, Nucleic acids research.

[5]  R. Guigó,et al.  Comparative gene prediction in human and mouse. , 2003, Genome research.

[6]  M. Brent,et al.  Recent advances in gene structure prediction. , 2004, Current opinion in structural biology.

[7]  Y. Sakaki,et al.  A novel index which precisely derives protein coding regions from cross-species genome alignments. , 2002, Genome informatics. International Conference on Genome Informatics.

[8]  D. Haussler,et al.  Article Identification and Characterization of Multi-Species Conserved Sequences , 2022 .

[9]  David Haussler,et al.  Computational identification of evolutionarily conserved exons , 2004, RECOMB.

[10]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[11]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[12]  Anton Nekrutenko,et al.  An evolutionary approach reveals a high protein-coding capacity of the human genome. , 2003, Trends in genetics : TIG.

[13]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.