Computational identification of evolutionarily conserved exons

Phylogenetic hidden Markov models (phylo-HMMs) have recently been proposed as a means for addressing a multi-species version of the ab initio gene prediction problem. These models allow sequence divergence, a phylogeny, patterns of substitution, and base composition all to be considered simultaneously, in a single unified probabilistic model. Here, we apply phylo-HMMs to a restricted version of the gene prediction problem in which individual exons are sought that are evolutionarily conserved across a diverse set of species. We discuss two new methods for improving prediction performance: (1) the use of context-dependent phylogenetic models, which capture phenomena such as a strong CpG effect in noncoding regions and a preference for synonymous rather than nonsynonymous substitutions in coding regions; and (2) a novel strategy for incorporating insertions and deletion (indels) into the state-transition structure of the model, which captures the different characteristic patterns of alignment gaps in coding and noncoding regions. We also discuss the technique, previously used in pairwise gene predictors, of explicitly modeling conserved noncoding sequence to help reduce false positive predictions. These methods have been incorporated into an exon prediction program called ExoniPhy, and tested with two large data sets. Experimental results indicate that all three methods produce significant improvements in prediction performance. In combination, they lead to prediction accuracy comparable to that of some of the best available gene predictors, despite several limitations of our current models.

[1]  Lior Pachter,et al.  Multiple-sequence functional annotation and the generalized hidden Markov phylogeny , 2004, Bioinform..

[2]  Yun S. Song,et al.  An Efficient Algorithm for Statistical Multiple Alignment on Arbitrary Phylogenetic Trees , 2003, J. Comput. Biol..

[3]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[4]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[5]  David Haussler,et al.  Combining Phylogenetic and Hidden Markov Models in Biosequence Analysis , 2004, J. Comput. Biol..

[6]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[7]  R. Guigó,et al.  Comparative gene prediction in human and mouse. , 2003, Genome research.

[8]  M. Brent,et al.  Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[9]  D. Haussler,et al.  Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. , 2003, Molecular biology and evolution.

[10]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[11]  J. W. Thomas,et al.  Comparative analyses of multi-species sequences from targeted genomic regions , 2003, Nature.

[12]  D Haussler,et al.  The share of human genomic DNA under selection estimated from human-mouse genomic alignments. , 2003, Cold Spring Harbor symposia on quantitative biology.

[13]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[14]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[15]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[16]  D. Haussler,et al.  Article Identification and Characterization of Multi-Species Conserved Sequences , 2022 .

[17]  M. Brent,et al.  Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. , 2003, Genome research.

[18]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[19]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[21]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[22]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[23]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[24]  David C. Jones,et al.  Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. , 1996, Journal of molecular biology.

[25]  David Haussler,et al.  Transcriptome and Genome Conservation of Alternative Splicing Events in Humans and Mice , 2003, Pacific Symposium on Biocomputing.

[26]  David Haussler,et al.  Phylogenetic Hidden Markov Models , 2005 .

[27]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[28]  Jakob Skou Pedersen,et al.  Gene finding with a hidden Markov model of genome structure and evolution , 2003, Bioinform..

[29]  Z. Yang,et al.  A space-time process model for the evolution of DNA sequences. , 1995, Genetics.

[30]  Jon D. McAuliffe,et al.  Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome , 2003, Science.

[31]  Lior Pachter,et al.  MAVID multiple alignment server , 2003, Nucleic Acids Res..

[32]  Richard Durbin,et al.  Comparative ab initio prediction of gene structures using pair HMMs , 2002, Bioinform..

[33]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[34]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[35]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[36]  P. Lio’,et al.  Models of molecular evolution and phylogeny. , 1998, Genome research.

[37]  David Haussler,et al.  Scoring two-species local alignments to try to statistically separate neutrally evolving from selected DNA segments , 2003, RECOMB '03.

[38]  W. Murphy,et al.  Resolution of the Early Placental Mammal Radiation Using Bayesian Phylogenetics , 2001, Science.

[39]  P. Lio’,et al.  Molecular phylogenetics: state-of-the-art methods for looking into the past. , 2001, Trends in genetics : TIG.

[40]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[41]  Elizabeth Pennisi,et al.  Gene Counters Struggle to Get the Right Answer , 2003, Science.

[42]  S. Kasif,et al.  Human-mouse gene identification by comparative evidence integration and evolutionary analysis. , 2003, Genome research.

[43]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[44]  Ian Dunham,et al.  Reevaluating human gene annotation: a second-generation analysis of chromosome 22. , 2003, Genome research.