Comparative exon prediction based on heuristic coding region alignment

Identifying protein coding genes is one of most challenging problems in computational molecular biology. With increasing numbers of sequenced eukaryotic genomes and syntenic maps across species, it is possible to apply genomic comparison for gene recognition. Here, we propose a program, EXONALIGN, which simultaneously aligns and predicts exons between homologous genomic sequences. The program applies CORAL (coding region alignment), a heuristic linear time alignment tool, to determine whether the regions following the conserved splice signals pairs are significant or not. The approach which combines the intrinsic splice site strength with the conservation of protein coding regions and exon-intron structures reduces the computation time and increases the prediction accuracy. EXONALIGN was tested on ROSETTA data set of 117 human-mouse homologous sequence pairs. At the exon level the sensitivity and specificity of EXONALIGN are respectively 89% and 88%, and both are 98% at the nucleotide level. The rates of missing exons and wrong exons are as low as 2%.

[1]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[2]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[3]  B. Berger,et al.  Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction , 2000 .

[4]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[5]  Steven Salzberg,et al.  GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders , 2003, Nucleic Acids Res..

[6]  Daniel H. Huson,et al.  The Conserved Exon Method for Gene Finding , 2000, ISMB.

[7]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[8]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[9]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[10]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[11]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[12]  Chung-Chin Lu,et al.  Prediction of splice sites with dependency graphs and their expanded bayesian networks , 2005, Bioinform..

[13]  R. Guigó,et al.  SGP-1: prediction and validation of homologous genes based on sequence alignments. , 2001, Genome research.

[14]  M. Boguski,et al.  Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[16]  Mikhail S. Gelfand,et al.  Combinatorial Approaches to Gene Recognition , 1997, Comput. Chem..

[17]  D. Haussler,et al.  Genie--gene finding in Drosophila melanogaster. , 2000, Genome research.

[18]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[19]  R. Guigó,et al.  Comparative gene prediction in human and mouse. , 2003, Genome research.

[20]  Anton Nekrutenko,et al.  ETOPE: evolutionary test of predicted exons , 2003, Nucleic Acids Res..

[21]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[22]  M. Brent,et al.  Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[23]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[24]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[25]  D. Labie,et al.  Molecular Evolution , 1991, Nature.

[26]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[27]  R. Durbin,et al.  Using GeneWise in the Drosophila annotation experiment. , 2000, Genome research.

[28]  Michael Q. Zhang Computational prediction of eukaryotic protein-coding genes , 2002, Nature Reviews Genetics.

[29]  W. J. Kent,et al.  Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. , 2000, Genome research.

[30]  Richard Durbin,et al.  Comparative ab initio prediction of gene structures using pair HMMs , 2002, Bioinform..

[31]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Mikhail S. Gelfand,et al.  Gene recognition in eukaryotic DNA by comparison of genomic sequences , 2001, Bioinform..