Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change

BackgroundNon-coding RNAs (ncRNAs) have a multitude of roles in the cell, many of which remain to be discovered. However, it is difficult to detect novel ncRNAs in biochemical screens. To advance biological knowledge, computational methods that can accurately detect ncRNAs in sequenced genomes are therefore desirable. The increasing number of genomic sequences provides a rich dataset for computational comparative sequence analysis and detection of novel ncRNAs.ResultsHere, Dynalign, a program for predicting secondary structures common to two RNA sequences on the basis of minimizing folding free energy change, is utilized as a computational ncRNA detection tool. The Dynalign-computed optimal total free energy change, which scores the structural alignment and the free energy change of folding into a common structure for two RNA sequences, is shown to be an effective measure for distinguishing ncRNA from randomized sequences. To make the classification as a ncRNA, the total free energy change of an input sequence pair can either be compared with the total free energy changes of a set of control sequence pairs, or be used in combination with sequence length and nucleotide frequencies as input to a classification support vector machine. The latter method is much faster, but slightly less sensitive at a given specificity. Additionally, the classification support vector machine method is shown to be sensitive and specific on genomic ncRNA screens of two different Escherichia coli and Salmonella typhi genome alignments, in which many ncRNAs are known. The Dynalign computational experiments are also compared with two other ncRNA detection programs, RNAz and QRNA.ConclusionThe Dynalign-based support vector machine method is more sensitive for known ncRNAs in the test genomic screens than RNAz and QRNA. Additionally, both Dynalign-based methods are more sensitive than RNAz and QRNA at low sequence pair identities. Dynalign can be used as a comparable or more accurate tool than RNAz or QRNA in genomic screens, especially for low-identity regions. Dynalign provides a method for discovering ncRNAs in sequenced genomes that other methods may not identify. Significant improvements in Dynalign runtime have also been achieved.

[1]  Ian Holmes,et al.  Stem Stem Stem Stem Loop Loop Loop LoopLoop Loop Loop Loop Loop Loop Loop , 2005 .

[2]  J. Berg Genome sequence of the nematode C. elegans: a platform for investigating biology. , 1998, Science.

[3]  Wayne A. Decatur,et al.  Genome-wide searching for pseudouridylation guide snoRNAs: analysis of the Saccharomyces cerevisiae genome. , 2004, Nucleic acids research.

[4]  Boris Lenhard,et al.  RNAdb—a comprehensive mammalian noncoding RNA database , 2004, Nucleic Acids Res..

[5]  Peter F Stadler,et al.  Fast and reliable prediction of noncoding RNAs , 2005, Proc. Natl. Acad. Sci. USA.

[6]  Christian Zwieb,et al.  The signal recognition particle database (SRPDB) , 1993, Nucleic Acids Res..

[7]  K. Wassarman Small RNAs in Bacteria Diverse Regulators of Gene Expression in Response to Environmental Changes , 2002, Cell.

[8]  Gary D. Stormo,et al.  Pairwise local structural alignment of RNA sequences with sequence similarity less than 40% , 2005, Bioinform..

[9]  Peter F. Stadler,et al.  Alignment of RNA base pairing probability matrices , 2004, Bioinform..

[10]  T. Steitz,et al.  The structural basis of ribosome activity in peptide bond synthesis. , 2000, Science.

[11]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[12]  Peter Walter,et al.  Signal recognition particle contains a 7S RNA essential for protein translocation across the endoplasmic reticulum , 1982, Nature.

[13]  D. Turner,et al.  Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. , 1998, Biochemistry.

[14]  Maciej Szymanski,et al.  Noncoding regulatory RNAs database , 2003, Nucleic Acids Res..

[15]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[16]  K. Umesono,et al.  Comparative and functional anatomy of group II catalytic introns--a review. , 1989, Gene.

[17]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[18]  Carl R. Woese,et al.  4 Probing RNA Structure, Function, and History by Comparative Analysis , 1993 .

[19]  I. Hofacker,et al.  Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics. , 2004, Journal of molecular biology.

[20]  P. Clote,et al.  Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. , 2005, RNA.

[21]  J. Sabina,et al.  Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. , 1999, Journal of molecular biology.

[22]  Sean R. Eddy,et al.  Rfam: annotating non-coding RNAs in complete genomes , 2004, Nucleic Acids Res..

[23]  Nan Yu,et al.  The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs , 2002, BMC Bioinformatics.

[24]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[25]  C. Burge,et al.  The microRNAs of Caenorhabditis elegans. , 2003, Genes & development.

[26]  A. Hüttenhofer,et al.  RNomics: an experimental approach that identifies 201 candidates for novel, small, non‐messenger RNAs in mouse , 2001, The EMBO journal.

[27]  Andrew Smith Genome sequence of the nematode C-elegans: A platform for investigating biology , 1998 .

[28]  Ian Holmes,et al.  Using evolutionary Expectation Maximization to estimate indel rates , 2005, Bioinform..

[29]  David H Mathews,et al.  Secondary structure models of the 3' untranslated regions of diverse R2 RNAs. , 2004, RNA.

[30]  James W. Brown The ribonuclease P database , 1997, Nucleic Acids Res..

[31]  Christian Zwieb,et al.  The Signal Recognition Particle Database (SRPDB) , 1993, Nucleic Acids Res..

[32]  C. Ponting,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[33]  G. Storz,et al.  The Sm-like Hfq protein increases OxyS RNA interaction with target mRNAs. , 2002, Molecular cell.

[34]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[35]  L. Lim,et al.  An Abundant Class of Tiny RNAs with Probable Regulatory Roles in Caenorhabditis elegans , 2001, Science.

[36]  J. Goodrich,et al.  The SINE-encoded mouse B2 RNA represses mRNA transcription in response to heat shock , 2004, Nature Structural &Molecular Biology.

[37]  E. Eichler,et al.  Shotgun sequence assembly and recent segmental duplications within the human genome , 2004, Nature.

[38]  D. Turner,et al.  Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[39]  B. Ganem RNA world , 1987, Nature.

[40]  A. Hüttenhofer,et al.  The expanding snoRNA world. , 2002, Biochimie.

[41]  Thomas A Steitz,et al.  Structural insights into peptide bond formation , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[42]  S. Eddy,et al.  Computational identification of noncoding RNAs in E. coli by comparative genomics , 2001, Current Biology.

[43]  M. Zuker On finding all suboptimal foldings of an RNA molecule. , 1989, Science.

[44]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[45]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[46]  Sean R. Eddy,et al.  Rfam: an RNA family database , 2003, Nucleic Acids Res..

[47]  E. Birney,et al.  Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs , 2002, Nature.

[48]  M. Hentze,et al.  Finding the hairpin in the haystack: searching for RNA motifs. , 1995, Trends in genetics : TIG.

[49]  Sergey Steinberg,et al.  Compilation of tRNA sequences and sequences of tRNA genes , 2004, Nucleic Acids Res..

[50]  S. Altschul,et al.  Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. , 1985, Molecular biology and evolution.

[51]  Jennifer A. Doudna,et al.  The chemical repertoire of natural ribozymes , 2002, Nature.

[52]  Jih-Hsiang Chen,et al.  A program for predicting significant RNA secondary structures , 1988, Comput. Appl. Biosci..

[53]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[54]  Bruce A. Shapiro,et al.  A computational procedure for assessing the significance of RNA secondary structure , 1990, Comput. Appl. Biosci..

[55]  S. Eddy Non–coding RNA genes and the modern RNA world , 2001, Nature Reviews Genetics.

[56]  Sean R. Eddy,et al.  RSEARCH: Finding homologs of single structured RNA sequences , 2003, BMC Bioinformatics.

[57]  P. Stadler,et al.  Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome , 2005, Nature Biotechnology.

[58]  G. Rubin,et al.  The Drosophila melanogaster genome. , 2003, Annual review of genomics and human genetics.

[59]  Miroslawa Z. Barciszewska,et al.  5S ribosomal RNA database Y2K , 2000, Nucleic Acids Res..

[60]  D. Turner,et al.  Secondary structure model of the RNA recognized by the reverse transcriptase from the R2 retrotransposable element. , 1997, RNA.

[61]  T. Tuschl,et al.  Identification of Novel Genes Coding for Small Expressed RNAs , 2001, Science.

[62]  J. Miranda-Ríos,et al.  A conserved RNA structure (thi box) is involved in regulation of thiamin biosynthetic gene expression in bacteria , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[63]  Bryan R. Cullen,et al.  RNA interference: antiviral defense and genetic tool , 2002, Nature Immunology.

[64]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[65]  H. Margalit,et al.  Novel small RNA-encoding genes in the intergenic regions of Escherichia coli , 2001, Current Biology.

[66]  M. A. Rosenblad,et al.  Prediction of signal recognition particle RNA genes. , 2002, Nucleic acids research.

[67]  Gary D. Stormo,et al.  Finding Common Sequence and Structure Motifs in a Set of RNA Sequences , 1997, ISMB.

[68]  Elena Rivas,et al.  Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs , 2000, Bioinform..

[69]  D. Turner,et al.  Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. , 2002, Journal of molecular biology.

[70]  A. Krogh,et al.  No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. , 1999, Nucleic acids research.

[71]  A. Wilm,et al.  A benchmark of multiple sequence alignment programs upon structural RNAs , 2005, Nucleic acids research.

[72]  S. Eddy,et al.  Computational identification of non-coding RNAs in Saccharomyces cerevisiae by comparative genomics. , 2003, Nucleic acids research.

[73]  Vincent Moulton,et al.  A Search for H/ACA SnoRNAs in Yeast Using MFE Secondary Structure Prediction , 2003, Bioinform..

[74]  David H. Mathews,et al.  Predicting a set of minimal free energy RNA secondary structures common to two sequences , 2005, Bioinform..

[75]  E. Blackburn,et al.  The end of the (DNA) line , 2000, Nature Structural Biology.

[76]  Xin Wang,et al.  A novel sRNA component of the carbon storage regulatory system of Escherichia coli , 2003, Molecular microbiology.

[77]  T. Lowe,et al.  A guided tour: small RNA function in Archaea , 2001, Molecular microbiology.

[78]  Ilka M. Axmann,et al.  Identification of cyanobacterial non-coding RNAs by comparative genome analysis , 2005, Genome Biology.

[79]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[80]  S. Eddy,et al.  A computational screen for methylation guide snoRNAs in yeast. , 1999, Science.

[81]  Kim Rutherford,et al.  Complete genome sequence of a multiple drug resistant Salmonella enterica serovar Typhi CT18 , 2001, Nature.

[82]  G. Storz,et al.  Identification of novel small RNAs using comparative genomics and microarrays. , 2001, Genes & development.

[83]  Elena Rivas,et al.  Noncoding RNA gene detection using comparative sequence analysis , 2001, BMC Bioinformatics.

[84]  Kara Dolinski,et al.  Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms , 2004, Nucleic Acids Res..

[85]  S. Le,et al.  Thermodynamic stability and statistical significance of potential stem-loop structures situated at the frameshift sites of retroviruses. , 1989, Nucleic acids research.

[86]  R. Jaenisch,et al.  RNA and the Epigenetic Regulation of X Chromosome Inactivation , 1998, Cell.