A Computational Approach to Search for Non-Coding RNAs in Large Genomic Data

Over the last few years several specialized software tools have been developed, each allowing a certain class of RNAs insequencedatatobe found.Herewedescribeageneral tool that allows us to specify many different non-coding RNAs and structural RNA elements by a simple pattern description language.To take into account that RNA is normally conserved in structure as well as in sequence, the pattern description language combines methods to describe sequence and structural similarities as well as further characteristics, e.g., thermodynamic constraints. Structure- and sequence-based patterns describing certain classes of RNAs are collected in a web-based pattern library. These include simple patterns, e.g., describing extrastable tetraloops and small regulatory stem-loop structures, as well as more complex patterns, for example describing pseudoknots, ribozymes, SRP RNAs, 5S RNA and selenocysteine insertion sequences.Aweb-based service allows a user to search the patterns fromthe library in sequences given by the user. Alternatively, the user can specify a pattern that is searched for in public genomic sequence data. Here we give a comprehensive introduction of the pattern language, describe how to systematically derive pattern descriptions, and show some results on purine riboswitches obtained using this computational approach.

[1]  Graziano Pesole,et al.  PatSearch: a program for the detection of patterns and structural motifs in nucleotide sequences , 2003, Nucleic Acids Res..

[2]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[3]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[4]  R. Overbeek,et al.  Searching for patterns in genomic data. , 1997, Trends in genetics : TIG.

[5]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[6]  J. Thompson,et al.  Multiple sequence alignment with Clustal X. , 1998, Trends in biochemical sciences.

[7]  M. Hentze,et al.  Finding the hairpin in the haystack: searching for RNA motifs. , 1995, Trends in genetics : TIG.

[8]  A. Cornish-Bowden Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. , 1985, Nucleic acids research.

[9]  R. Gutell,et al.  Diversity of base-pair conformations and their occurrence in rRNA structure and RNA structural motifs. , 2004, Journal of molecular biology.

[10]  R. Lück,et al.  ConStruct: a tool for thermodynamic controlled prediction of conserved secondary structure. , 1999, Nucleic acids research.

[11]  G. Stormo,et al.  Discovering common stem-loop motifs in unaligned RNA sequences. , 2001, Nucleic acids research.

[12]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[13]  R. Montange,et al.  Structure of a natural guanine-responsive riboswitch complexed with the metabolite hypoxanthine , 2004, Nature.

[14]  Gary D. Stormo,et al.  Displaying the information contents of structural RNA alignments: the structure logos , 1997, Comput. Appl. Biosci..

[15]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[16]  Jeffrey E. Barrick,et al.  Riboswitches Control Fundamental Biochemical Pathways in Bacillus subtilis and Other Bacteria , 2003, Cell.

[17]  C Gaspin,et al.  ESSA: an integrated and interactive computer tool for analysing RNA secondary structure. , 1997, Nucleic acids research.

[18]  Eric Westhof,et al.  The non-Watson-Crick base pairs and their associated isostericity matrices. , 2002, Nucleic acids research.

[19]  Rodger Staden,et al.  Methods for calculating the probabilities of finding patterns in sequences , 1989, Comput. Appl. Biosci..

[20]  Z. Weng,et al.  Finding functional sequence elements by multiple local alignment. , 2004, Nucleic acids research.

[21]  D. Brutlag,et al.  Highly specific protein sequence motifs for genome analysis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Ivo L. Hofacker,et al.  Vienna RNA secondary structure server , 2003, Nucleic Acids Res..

[23]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[24]  Peter F. Stadler,et al.  Alignment of RNA base pairing probability matrices , 2004, Bioinform..

[25]  D. Ecker,et al.  RNAMotif, an RNA secondary structure definition and search algorithm. , 2001, Nucleic acids research.

[26]  O. Gotoh Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. , 1996, Journal of molecular biology.

[27]  S. Karlin,et al.  Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.

[28]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[29]  Dirk Strothmann,et al.  HyPaLib: a database of RNAs and RNA structural elements defined by hybrid patterns , 2001, Nucleic Acids Res..

[30]  Amos Bairoch,et al.  Recent improvements to the PROSITE database , 2004, Nucleic Acids Res..

[31]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[32]  Michael Beckstette,et al.  PoSSuMsearch: Fast and Sensitive Matching of Position Specific Scoring Matrices using Enhanced Suffix Arrays , 2004, German Conference on Bioinformatics.

[33]  Robert Giegerich,et al.  Local similarity in RNA secondary structures , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.