Comparative ab initio prediction of gene structures using pair HMMs

We present a novel comparative method for the ab initio prediction of protein coding genes in eukaryotic genomes. The method simultaneously predicts the gene structures of two un-annotated input DNA sequences which are homologous to each other and retrieves the subsequences which are conserved between the two DNA sequences. It is capable of predicting partial, complete and multiple genes and can align pairs of genes which differ by events of exon-fusion or exon-splitting. The method employs a probabilistic pair hidden Markov model. We generate annotations using our model with two different algorithms: the Viterbi algorithm in its linear memory implementation and a new heuristic algorithm, called the stepping stone, for which both memory and time requirements scale linearly with the sequence length. We have implemented the model in a computer program called DOUBLESCAN. In this article, we introduce the method and confirm the validity of the approach on a test set of 80 pairs of orthologous DNA sequences from mouse and human. More information can be found at: http://www.sanger.ac.uk/Software/analysis/doublescan/

[1]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[2]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[3]  G. Bernardi,et al.  The isochore organization of the human genome. , 1989, Annual review of genetics.

[4]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[5]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[6]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[7]  Anders Krogh,et al.  Two Methods for Improving Performance of a HMM and their Application for Gene Finding , 1997, ISMB.

[8]  R. Gibbs,et al.  Large-scale comparative sequence analysis of the human and murine Bruton's tyrosine kinase loci reveals conserved regulatory domains. , 1997, Genome research.

[9]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[10]  B. Berger,et al.  Domino tiling, gene recognition, and mice , 1999 .

[11]  R. Durbin,et al.  Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. , 1999, Genome research.

[12]  V. Solovyev,et al.  Analysis of canonical and non-canonical splice sites in mammalian genomes. , 2000, Nucleic acids research.

[13]  B. Berger,et al.  Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction , 2000 .

[14]  W. J. Kent,et al.  Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. , 2000, Genome research.

[15]  Simon Cawley,et al.  Applications of generalized pair hidden Markov models to alignment and gene finding problems , 2001, J. Comput. Biol..

[16]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[17]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[18]  Alan K. Mackworth,et al.  Improving gene recognition accuracy by combining predictions from two gene-finding programs , 2002, Bioinform..

[19]  L. Duret,et al.  Statistical analysis of vertebrate sequences reveals that long genes are scarce in GC-rich isochores , 1995, Journal of Molecular Evolution.