Identifying novel constrained elements by exploiting biased substitution patterns

MOTIVATION Comparing the genomes from closely related species provides a powerful tool to identify functional elements in a reference genome. Many methods have been developed to identify conserved sequences across species; however, existing methods only model conservation as a decrease in the rate of mutation and have ignored selection acting on the pattern of mutations. RESULTS We present a new approach that takes advantage of deeply sequenced clades to identify evolutionary selection by uncovering not only signatures of rate-based conservation but also substitution patterns characteristic of sequence undergoing natural selection. We describe a new statistical method for modeling biased nucleotide substitutions, a learning algorithm for inferring site-specific substitution biases directly from sequence alignments and a hidden Markov model for detecting constrained elements characterized by biased substitutions. We show that the new approach can identify significantly more degenerate constrained sequences than rate-based methods. Applying it to the ENCODE regions, we identify as much as 10.2% of these regions are under selection. AVAILABILITY The algorithms are implemented in a Java software package, called SiPhy, freely available at http://www.broadinstitute.org/science/software/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  A. Halpern,et al.  Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. , 1998, Molecular biology and evolution.

[3]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[4]  I Holmes,et al.  An expectation maximization algorithm for training hidden substitution models. , 2002, Journal of molecular biology.

[5]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[6]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[7]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[8]  D. Haussler,et al.  Article Identification and Characterization of Multi-Species Conserved Sequences , 2022 .

[9]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[10]  K. Lindblad-Toh,et al.  Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals , 2005, Nature.

[11]  S. Eddy A Model of the Statistical Power of Comparative Genome Sequence Analysis , 2005, PLoS biology.

[12]  Joaquín Dopazo,et al.  PupasView: a visual tool for selecting suitable SNPs, with putative pathological effect in genes, for genotyping purposes , 2005, Nucleic Acids Res..

[13]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[14]  A. Hobolth,et al.  Statistical Applications in Genetics and Molecular Biology Statistical Inference in Evolutionary Models of DNA Sequences via the EM Algorithm , 2011 .

[15]  S. Batzoglou,et al.  Distribution and intensity of constraint in mammalian genomic sequence. , 2005, Genome research.

[16]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[17]  Jean L. Chang,et al.  An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[18]  E. Lander,et al.  A large family of ancient repeat elements in the human genome is under strong selection. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[19]  D. Haussler,et al.  A distal enhancer and an ultraconserved exon are derived from a novel retroposon , 2006, Nature.

[20]  E. Lander,et al.  A family of conserved noncoding elements derived from an ancient transposable element. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Lior Pachter,et al.  Phylogenetic Profiling of Insertions and Deletions in Vertebrate Genomes , 2006, RECOMB.

[22]  Colin N. Dewey,et al.  Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures , 2007, Nature.

[23]  Daniel J. Blankenberg,et al.  28-way vertebrate alignment and conservation track in the UCSC Genome Browser. , 2007, Genome research.

[24]  P. Green 2x genomes--does depth matter? , 2007, Genome research.

[25]  Mikhail A. Roytberg,et al.  Analysis of Sequence Conservation at Nucleotide Resolution , 2007, PLoS Comput. Biol..

[26]  Colin N. Dewey,et al.  Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. , 2007, Genome research.

[27]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[28]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[29]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[30]  Mathieu Blanchette,et al.  Exact and Heuristic Algorithms for the Indel Maximum Likelihood Problem , 2007, J. Comput. Biol..

[31]  E. Birney,et al.  Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes , 2008, Nature Reviews Genetics.

[32]  Elena Rivas,et al.  Probabilistic Phylogenetic Inference with Insertions and Deletions , 2008, PLoS Comput. Biol..

[33]  Michael F. Lin,et al.  Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals , 2009, Nature.