Multiple-sequence functional annotation and the generalized hidden Markov phylogeny

MOTIVATION Phylogenetic shadowing is a comparative genomics principle that allows for the discovery of conserved regions in sequences from multiple closely related organisms. We develop a formal probabilistic framework for combining phylogenetic shadowing with feature-based functional annotation methods. The resulting model, a generalized hidden Markov phylogeny (GHMP), applies to a variety of situations where functional regions are to be inferred from evolutionary constraints. RESULTS We show how GHMPs can be used to predict complete shared gene structures in multiple primate sequences. We also describe shadower, our implementation of such a prediction system. We find that shadower outperforms previously reported ab initio gene finders, including comparative human-mouse approaches, on a small sample of diverse exonic regions. Finally, we report on an empirical analysis of shadower's performance which reveals that as few as five well-chosen species may suffice to attain maximal sensitivity and specificity in exon demarcation. AVAILABILITY A Web server is available at http://bonaire.lbl.gov/shadower

[1]  R. Guigó,et al.  Comparative gene prediction in human and mouse. , 2003, Genome research.

[2]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[3]  Michael I. Jordan,et al.  Probabilistic Networks and Expert Systems , 1999 .

[4]  Hideo Matsuda,et al.  fastDNAmL: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood , 1994, Comput. Appl. Biosci..

[5]  David Haussler,et al.  Combining Phylogenetic and Hidden Markov Models in Biosequence Analysis , 2004, J. Comput. Biol..

[6]  Jon D. McAuliffe,et al.  Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome , 2003, Science.

[7]  Michael Ruogu Zhang,et al.  Statistical features of human exons and their flanking regions. , 1998, Human molecular genetics.

[8]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[9]  Jakob Skou Pedersen,et al.  Gene finding with a hidden Markov model of genome structure and evolution , 2003, Bioinform..

[10]  Z. Yang,et al.  A space-time process model for the evolution of DNA sequences. , 1995, Genetics.

[11]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[12]  Michael Q. Zhang Computational prediction of eukaryotic protein-coding genes , 2002, Nature Reviews Genetics.

[13]  David C. Jones,et al.  Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. , 1996, Journal of molecular biology.

[14]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[15]  Richard Durbin,et al.  Comparative ab initio prediction of gene structures using pair HMMs , 2002, Bioinform..

[16]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[17]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Lior Pachter,et al.  MAVID multiple alignment server , 2003, Nucleic Acids Res..

[19]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[20]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[21]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[22]  Arend Sidow,et al.  Inference of functional regions in proteins by quantification of evolutionary constraints , 2002, Proceedings of the National Academy of Sciences of the United States of America.