论文信息 - Identifying novel constrained elements by exploiting biased substitution patterns

Identifying novel constrained elements by exploiting biased substitution patterns

MOTIVATION Comparing the genomes from closely related species provides a powerful tool to identify functional elements in a reference genome. Many methods have been developed to identify conserved sequences across species; however, existing methods only model conservation as a decrease in the rate of mutation and have ignored selection acting on the pattern of mutations. RESULTS We present a new approach that takes advantage of deeply sequenced clades to identify evolutionary selection by uncovering not only signatures of rate-based conservation but also substitution patterns characteristic of sequence undergoing natural selection. We describe a new statistical method for modeling biased nucleotide substitutions, a learning algorithm for inferring site-specific substitution biases directly from sequence alignments and a hidden Markov model for detecting constrained elements characterized by biased substitutions. We show that the new approach can identify significantly more degenerate constrained sequences than rate-based methods. Applying it to the ENCODE regions, we identify as much as 10.2% of these regions are under selection. AVAILABILITY The algorithms are implemented in a Java software package, called SiPhy, freely available at http://www.broadinstitute.org/science/software/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2] A. Halpern,et al. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. , 1998, Molecular biology and evolution.

[3] Durbin,et al. Biological Sequence Analysis , 1998 .

[4] I Holmes,et al. An expectation maximization algorithm for training hidden substitution models. , 2002, Journal of molecular biology.

[5] Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[6] Tom H. Pringle,et al. The human genome browser at UCSC. , 2002, Genome research.

[7] Colin N. Dewey,et al. Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[8] D. Haussler,et al. Article Identification and Characterization of Multi-Species Conserved Sequences , 2022 .

[9] D. Haussler,et al. Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[10] K. Lindblad-Toh,et al. Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals , 2005, Nature.

[11] S. Eddy. A Model of the Statistical Power of Comparative Genome Sequence Analysis , 2005, PLoS biology.

[12] Joaquín Dopazo,et al. PupasView: a visual tool for selecting suitable SNPs, with putative pathological effect in genes, for genotyping purposes , 2005, Nucleic Acids Res..

[13] Tatiana A. Tatusova,et al. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[14] A. Hobolth,et al. Statistical Applications in Genetics and Molecular Biology Statistical Inference in Evolutionary Models of DNA Sequences via the EM Algorithm , 2011 .

[15] S. Batzoglou,et al. Distribution and intensity of constraint in mammalian genomic sequence. , 2005, Genome research.

[16] D. Haussler,et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[17] Jean L. Chang,et al. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[18] E. Lander,et al. A large family of ancient repeat elements in the human genome is under strong selection. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[19] D. Haussler,et al. A distal enhancer and an ultraconserved exon are derived from a novel retroposon , 2006, Nature.

[20] E. Lander,et al. A family of conserved noncoding elements derived from an ancient transposable element. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[21] Lior Pachter,et al. Phylogenetic Profiling of Insertions and Deletions in Vertebrate Genomes , 2006, RECOMB.

[22] Colin N. Dewey,et al. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures , 2007, Nature.

[23] Daniel J. Blankenberg,et al. 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. , 2007, Genome research.

[24] P. Green. 2x genomes--does depth matter? , 2007, Genome research.

[25] Mikhail A. Roytberg,et al. Analysis of Sequence Conservation at Nucleotide Resolution , 2007, PLoS Comput. Biol..

[26] Colin N. Dewey,et al. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. , 2007, Genome research.

[27] William Stafford Noble,et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[28] Tatiana Tatusova,et al. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[29] Ziheng Yang. PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[30] Mathieu Blanchette,et al. Exact and Heuristic Algorithms for the Indel Maximum Likelihood Problem , 2007, J. Comput. Biol..

[31] E. Birney,et al. Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes , 2008, Nature Reviews Genetics.

[32] Elena Rivas,et al. Probabilistic Phylogenetic Inference with Insertions and Deletions , 2008, PLoS Comput. Biol..

[33] Michael F. Lin,et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals , 2009, Nature.