Dinucleotide controlled null models for comparative RNA gene prediction

BackgroundComparative prediction of RNA structures can be used to identify functional noncoding RNAs in genomic screens. It was shown recently by Babak et al. [BMC Bioinformatics. 8:33] that RNA gene prediction programs can be biased by the genomic dinucleotide content, in particular those programs using a thermodynamic folding model including stacking energies. As a consequence, there is need for dinucleotide-preserving control strategies to assess the significance of such predictions. While there have been randomization algorithms for single sequences for many years, the problem has remained challenging for multiple alignments and there is currently no algorithm available.ResultsWe present a program called SISSIz that simulates multiple alignments of a given average dinucleotide content. Meeting additional requirements of an accurate null model, the randomized alignments are on average of the same sequence diversity and preserve local conservation and gap patterns. We make use of a phylogenetic substitution model that includes overlapping dependencies and site-specific rates. Using fast heuristics and a distance based approach, a tree is estimated under this model which is used to guide the simulations. The new algorithm is tested on vertebrate genomic alignments and the effect on RNA structure predictions is studied. In addition, we directly combined the new null model with the RNAalifold consensus folding algorithm giving a new variant of a thermodynamic structure based RNA gene finding program that is not biased by the dinucleotide content.ConclusionSISSIz implements an efficient algorithm to randomize multiple alignments preserving dinucleotide content. It can be used to get more accurate estimates of false positive rates of existing programs, to produce negative controls for the training of machine learning based programs, or as standalone RNA gene finding program. Other applications in comparative genomics that require randomization of multiple alignments can be considered.AvailabilitySISSIz is available as open source C code that can be compiled for every major platform and downloaded here: http://sourceforge.net/projects/sissiz.

[1]  S. Altschul,et al.  Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. , 1985, Molecular biology and evolution.

[2]  B. Barrell,et al.  Genome-wide discovery and verification of novel structured RNAs in Plasmodium falciparum. , 2008, Genome research.

[3]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[4]  Peter F Stadler,et al.  Fast and reliable prediction of noncoding RNAs , 2005, Proc. Natl. Acad. Sci. USA.

[5]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[6]  David H. Mathews,et al.  Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change , 2006, BMC Bioinformatics.

[7]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[8]  J. L. Jensen,et al.  Probabilistic models of DNA sequence evolution with context dependent rates of substitution , 2000, Advances in Applied Probability.

[9]  David T. Jones,et al.  Protein evolution with dependence among codons due to tertiary structure. , 2003, Molecular biology and evolution.

[10]  I. Hofacker,et al.  Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics. , 2004, Journal of molecular biology.

[11]  M. Schoniger,et al.  Simulating efficiently the evolution of DNA sequences , 1995, Comput. Appl. Biosci..

[12]  Sean R. Eddy,et al.  Rfam: annotating non-coding RNAs in complete genomes , 2004, Nucleic Acids Res..

[13]  I. Holmes,et al.  A "Long Indel" model for evolutionary sequence alignment. , 2003, Molecular biology and evolution.

[14]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[15]  A. von Haeseler,et al.  Simulating efficiently the evolution of DNA sequences. , 1995, Computer applications in the biosciences : CABIOS.

[16]  P. Stadler,et al.  Prediction of structured non-coding RNAs in the genomes of the nematodes Caenorhabditis elegans and Caenorhabditis briggsae. , 2006, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[17]  P. Stadler,et al.  Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome , 2005, Nature Biotechnology.

[18]  Tanja Gesell,et al.  Phylogenetics In silico sequence evolution with site-specific interactions along phylogenetic trees , 2006 .

[19]  Paul P Gardner,et al.  Use of tiling array data and RNA secondary structure predictions to identify noncoding RNA genes , 2007, BMC Genomics.

[20]  Sonja J. Prohaska,et al.  RNAs everywhere: genome-wide annotation of structured RNAs. , 2007, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[21]  B. Berger,et al.  MSARI: multiple sequence alignments for statistical detection of RNA secondary structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Arndt von Haeseler,et al.  Simultaneous statistical multiple alignment and phylogeny reconstruction. , 2005, Systematic biology.

[23]  Walter Fontana,et al.  Fast folding and comparison of RNA secondary structures , 1994 .

[24]  Dirk Metzler,et al.  Statistical alignment based on fragment insertion and deletion models , 2003, Bioinform..

[25]  Zasha Weinberg,et al.  CMfinder - a covariance model based RNA motif finding algorithm , 2006, Bioinform..

[26]  E. Rivas,et al.  Identification of differentially expressed small non-coding RNAs in the legume endosymbiont Sinorhizobium meliloti by comparative genomics , 2007, Molecular microbiology.

[27]  O. F. Christensen,et al.  Pseudo-likelihood for Non-reversible Nucleotide Substitution Models with Neighbour Dependent Rates , 2006, Statistical applications in genetics and molecular biology.

[28]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[29]  Sam Griffiths-Jones,et al.  Annotating noncoding RNA genes. , 2007, Annual review of genomics and human genetics.

[30]  D. Penny Inferring Phylogenies.—Joseph Felsenstein. 2003. Sinauer Associates, Sunderland, Massachusetts. , 2004 .

[31]  Ivo L Hofacker,et al.  RNAs everywhere: genome-wide annotation of structured RNAs. , 2006, Genome informatics. International Conference on Genome Informatics.

[32]  Elena Rivas,et al.  Noncoding RNA gene detection using comparative sequence analysis , 2001, BMC Bioinformatics.

[33]  L. Duret,et al.  The covariation between TpA deficiency, CpG deficiency, and G+C content of human isochores is due to a mathematical artifact. , 2000, Molecular biology and evolution.

[34]  P. Clote,et al.  Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. , 2005, RNA.

[35]  David Haussler,et al.  Identification and Classification of Conserved RNA Secondary Structures in the Human Genome , 2006, PLoS Comput. Biol..

[36]  W. Ford Doolittle,et al.  An Updated and Comprehensive rRNA Phylogeny of (Crown) Eukaryotes Based on Rate-Calibrated Evolutionary Distances , 2000, Journal of Molecular Evolution.

[37]  Stefan Washietl,et al.  Prediction of structural noncoding RNAs with RNAz. , 2007, Methods in molecular biology.

[38]  P. Stadler,et al.  Secondary structure prediction for aligned RNA sequences. , 2002, Journal of molecular biology.

[39]  Jotun Hein,et al.  A nucleotide substitution model with nearest-neighbour interactions , 2004, ISMB/ECCB.

[40]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[41]  Ilka M. Axmann,et al.  Identification of cyanobacterial non-coding RNAs by comparative genome analysis , 2005, Genome Biology.

[42]  G. Serio,et al.  A new method for calculating evolutionary substitution rates , 2005, Journal of Molecular Evolution.

[43]  M. Gerstein,et al.  Structured Rnas in the Encode Selected Regions of the Human Genome , 2022 .

[44]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[45]  J. Gorodkin,et al.  Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. , 2006, Genome research.

[46]  A. von Haeseler,et al.  A stochastic model for the evolution of autocorrelated DNA sequences. , 1994, Molecular phylogenetics and evolution.

[47]  Christopher B. Burge,et al.  DNA sequence evolution with neighbor-dependent mutation , 2001, RECOMB '02.

[48]  Sonja J. Prohaska,et al.  Computational RNomics of Drosophilids , 2007, BMC Genomics.

[49]  Peter F. Stadler,et al.  Non-coding RNAs in Ciona intestinalis , 2005, ECCB/JBI.

[50]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[51]  Donald R Forsdyke,et al.  Calculation of folding energies of single-stranded nucleic acid sequences: conceptual issues. , 2007, Journal of theoretical biology.

[52]  D. Haussler,et al.  Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. , 2003, Molecular biology and evolution.

[53]  Thomas Sandmann,et al.  Identification of Novel Drosophila melanogaster MicroRNAs , 2007, PloS one.

[54]  Timothy R. Hughes,et al.  Considerations in the identification of functional RNA structural elements in genomic alignments , 2007, BMC Bioinformatics.

[55]  A. Krogh,et al.  No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. , 1999, Nucleic acids research.

[56]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[57]  J. L. Jensen,et al.  A dependent-rates model and an MCMC-based methodology for the maximum-likelihood analysis of sequences with overlapping reading frames. , 2001, Molecular biology and evolution.