RSEARCH: Finding homologs of single structured RNA sequences

BackgroundFor many RNA molecules, secondary structure rather than primary sequence is the evolutionarily conserved feature. No programs have yet been published that allow searching a sequence database for homologs of a single RNA molecule on the basis of secondary structure.ResultsWe have developed a program, RSEARCH, that takes a single RNA sequence with its secondary structure and utilizes a local alignment algorithm to search a database for homologous RNAs. For this purpose, we have developed a series of base pair and single nucleotide substitution matrices for RNA sequences called RIBOSUM matrices. RSEARCH reports the statistical confidence for each hit as well as the structural alignment of the hit. We show several examples in which RSEARCH outperforms the primary sequence search programs BLAST and SSEARCH. The primary drawback of the program is that it is slow. The C code for RSEARCH is freely available from our lab's website.ConclusionRSEARCH outperforms primary sequence programs in finding homologs of structured RNA sequences.

[1]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[2]  A. Böck,et al.  Selenoprotein synthesis in archaea: identification of an mRNA element of Methanococcus jannaschii probably directing selenocysteine insertion. , 1997, Journal of molecular biology.

[3]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[4]  R. Fleischmann,et al.  The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus , 1997, Nature.

[5]  R. C. Underwood,et al.  Stochastic context-free grammars for tRNA modeling. , 1994, Nucleic acids research.

[6]  E Westhof,et al.  Probing the structure of the regulatory region of human transferrin receptor messenger RNA and its interaction with iron regulatory protein-1. , 1997, RNA.

[7]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[8]  G. Storz,et al.  Identification of novel small RNAs using comparative genomics and microarrays. , 2001, Genes & development.

[9]  Daniel Gautheret,et al.  Pattern searching/alignment with RNA primary and secondary structures: an effective descriptor for tRNA , 1990, Comput. Appl. Biosci..

[10]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[11]  R D Klausner,et al.  A model for the structure and functions of iron-responsive elements. , 1988, Gene.

[12]  Ian Holmes,et al.  Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars , 2001, Pacific Symposium on Biocomputing.

[13]  Bjarne Knudsen,et al.  RNA secondary structure prediction using stochastic context-free grammars and evolutionary history , 1999, Bioinform..

[14]  L. Lim,et al.  An Abundant Class of Tiny RNAs with Probable Regulatory Roles in Caenorhabditis elegans , 2001, Science.

[15]  A. Hüttenhofer,et al.  Identification of 86 candidates for small non-messenger RNAs from the archaeon Archaeoglobus fulgidus , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[17]  E. J. Gumbel,et al.  Statistics of Extremes. , 1960 .

[18]  Graziano Pesole,et al.  PatSearch: a pattern matcher software that finds functional elements in nucleotide and protein sequences and assesses their statistical significance , 2000, Bioinform..

[19]  Yves Van de Peer,et al.  Database on the structure of small ribosomal subunit RNA , 1998, Nucleic Acids Res..

[20]  Yves Van de Peer,et al.  Database on the structure of large ribosomal subunit RNA , 1994, Nucleic Acids Res..

[21]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[22]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[23]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[24]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[25]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[26]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Gary D. Stormo,et al.  Do mRNAs act as direct sensors of small molecules to control their expression? , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[29]  S. Altschul,et al.  Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices , 1991 .

[30]  H. Margalit,et al.  Novel small RNA-encoding genes in the intergenic regions of Escherichia coli , 2001, Current Biology.

[31]  A. Viari,et al.  Palingol: a declarative programming language to describe nucleic acids' secondary structures and to scan sequence database. , 1996, Nucleic acids research.

[32]  S. Eddy Computational Genomics of Noncoding RNA Genes , 2002, Cell.

[33]  Michael Gribskov,et al.  Estimating and Evaluating the Statistics of Gapped Local-Alignment Scores , 2003, J. Comput. Biol..

[34]  P. Schattner Searching for RNA genes using base-composition statistics. , 2002, Nucleic acids research.

[35]  Richard Mott,et al.  Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores , 1992 .

[36]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[37]  N. Pace,et al.  Phylogenetic comparative analysis of RNA secondary structure. , 1989, Methods in enzymology.

[38]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[39]  Hélène Touzet,et al.  Finding the common structure shared by two homologous RNAs , 2003, Bioinform..

[40]  S. Eddy,et al.  Noncoding RNA genes identified in AT-rich hyperthermophiles , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[41]  I-Min A. Dubchak,et al.  A computational approach to identify genes for functional RNAs in genomic sequences. , 2001, Nucleic acids research.

[42]  D. Sankoff Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems , 1985 .

[43]  S Henikoff,et al.  Performance evaluation of amino acid substitution matrices , 1993, Proteins.

[44]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[45]  D. Turner,et al.  Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. , 2002, Journal of molecular biology.

[46]  Sean R. Eddy,et al.  A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure , 2002, BMC Bioinformatics.

[47]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[48]  D. Gautheret,et al.  Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles. , 2001, Journal of molecular biology.

[49]  H. Margalit,et al.  A survey of small RNA-encoding genes in Escherichia coli. , 2003, Nucleic acids research.

[50]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[51]  Christian Zwieb,et al.  SRPDB (Signal Recognition Particle Database) , 2000, Nucleic Acids Res..

[52]  Laurie J. Heyer,et al.  Finding the most significant common sequence and structure motifs in a set of RNA sequences. , 1997, Nucleic acids research.

[53]  J. Lawless Statistical Models and Methods for Lifetime Data , 2002 .

[54]  D. Ecker,et al.  RNAMotif, an RNA secondary structure definition and search algorithm. , 2001, Nucleic acids research.

[55]  ROY MARKHAM,et al.  Structure of Ribonucleic Acid , 1951, Nature.

[56]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[57]  E. Tillier,et al.  High apparent rate of simultaneous compensatory base-pair substitutions in ribosomal RNA. , 1998, Genetics.

[58]  S. Eddy Non–coding RNA genes and the modern RNA world , 2001, Nature Reviews Genetics.

[59]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[60]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[61]  Rolf Olsen,et al.  Rapid Assessment of Extremal Statistics for Gapped Local Alignment , 1999, ISMB.

[62]  Gordon Johnston,et al.  Statistical Models and Methods for Lifetime Data , 2003, Technometrics.

[63]  J. Harris,et al.  New insight into RNase P RNA structure from comparative analysis of the archaeal RNA. , 2001, RNA.

[64]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[65]  Ronald R. Breaker,et al.  Thiamine derivatives bind messenger RNAs directly to regulate bacterial gene expression , 2002, Nature.

[66]  Yves Van de Peer,et al.  Database on the structure of small ribosomal subunit RNA , 1998, Nucleic Acids Res..

[67]  J. Miranda-Ríos,et al.  A conserved RNA structure (thi box) is involved in regulation of thiamin biosynthetic gene expression in bacteria , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[68]  M. A. Rosenblad,et al.  Prediction of signal recognition particle RNA genes. , 2002, Nucleic acids research.

[69]  Daniel Gautheret,et al.  A survey of metazoan selenocysteine insertion sequences. , 2002, Biochimie.

[70]  Amos Bairoch,et al.  ScanProsite: a reference implementation of a PROSITE scanning tool. , 2002, Applied bioinformatics.

[71]  S. Muse Evolutionary analyses of DNA sequences subject to constraints of secondary structure. , 1995, Genetics.

[72]  Peter S. Pacheco Parallel programming with MPI , 1996 .

[73]  Sean R. Eddy,et al.  Biological sequence analysis: Contents , 1998 .

[74]  James W. Brown The ribonuclease P database , 1998, Nucleic Acids Res..

[75]  S. Eddy,et al.  Computational identification of noncoding RNAs in E. coli by comparative genomics , 2001, Current Biology.