RNACompress: Grammar-based compression and informational complexity measurement of RNA secondary structure

BackgroundWith the rapid emergence of RNA databases and newly identified non-coding RNAs, an efficient compression algorithm for RNA sequence and structural information is needed for the storage and analysis of such data. Although several algorithms for compressing DNA sequences have been proposed, none of them are suitable for the compression of RNA sequences with their secondary structures simultaneously. This kind of compression not only facilitates the maintenance of RNA data, but also supplies a novel way to measure the informational complexity of RNA structural data, raising the possibility of studying the relationship between the functional activities of RNA structures and their complexities, as well as various structural properties of RNA based on compression.ResultsRNACompress employs an efficient grammar-based model to compress RNA sequences and their secondary structures. The main goals of this algorithm are two fold: (1) present a robust and effective way for RNA structural data compression; (2) design a suitable model to represent RNA secondary structure as well as derive the informational complexity of the structural data based on compression. Our extensive tests have shown that RNACompress achieves a universally better compression ratio compared with other sequence-specific or common text-specific compression algorithms, such as Gencompress, winrar and gzip. Moreover, a test of the activities of distinct GTP-binding RNAs (aptamers) compared with their structural complexity shows that our defined informational complexity can be used to describe how complexity varies with activity. These results lead to an objective means of comparing the functional properties of heteropolymers from the information perspective.ConclusionA universal algorithm for the compression of RNA secondary structure as well as the evaluation of its informational complexity is discussed in this paper. We have developed RNACompress, as a useful tool for academic users. Extensive tests have shown that RNACompress is a universally efficient algorithm for the compression of RNA sequences with their secondary structures. RNACompress also serves as a good measurement of the informational complexity of RNA secondary structure, which can be used to study the functional activities of RNA molecules.

[1]  G. Rose,et al.  RNABase: an annotated database of RNA structures , 2003, Nucleic Acids Res..

[2]  Zasha Weinberg,et al.  CMfinder - a covariance model based RNA motif finding algorithm , 2006, Bioinform..

[3]  Sean R. Eddy,et al.  Rfam: an RNA family database , 2003, Nucleic Acids Res..

[4]  Sam Kwong,et al.  A Compression Algorithm for DNA Sequences and Its Applications in Genome Comparison. , 1999 .

[5]  Stephen H. Unger A global parser for context-free phrase structure grammars , 1968, CACM.

[6]  P. Higgs RNA secondary structure: physical and computational aspects , 2000, Quarterly Reviews of Biophysics.

[7]  Stéphane Grumbach,et al.  Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[8]  S. Eddy,et al.  A computational screen for methylation guide snoRNAs in yeast. , 1999, Science.

[9]  Robert Giegerich,et al.  RNAshapes: an integrated RNA analysis package based on abstract shapes. , 2006, Bioinformatics.

[10]  Jean-Paul Delahaye,et al.  A guaranteed compression scheme for repetitive DNA sequences , 1996, Proceedings of Data Compression Conference - DCC '96.

[11]  Tamás Kiss,et al.  Small nucleolar RNA‐guided post‐transcriptional modification of cellular RNAs , 2001, The EMBO journal.

[12]  Donald E. Knuth,et al.  Dynamic Huffman Coding , 1985, J. Algorithms.

[13]  R. Giegerich,et al.  Complete probabilistic analysis of RNA shapes , 2006, BMC Biology.

[14]  Jan Gorodkin,et al.  Multiple structural alignment and clustering of RNA sequences , 2007, Bioinform..

[15]  N. Pace,et al.  Ribonuclease P: unity and diversity in a tRNA processing ribozyme. , 1998, Annual review of biochemistry.

[16]  P. Zamore,et al.  ATP Requirements and Small Interfering RNA Structure in the RNA Interference Pathway , 2001, Cell.

[17]  Maciej Szymanski,et al.  5S Ribosomal RNA Database , 2002, Nucleic Acids Res..

[18]  Stijn van Dongen,et al.  miRBase: microRNA sequences, targets and gene nomenclature , 2005, Nucleic Acids Res..

[19]  Bjarne Knudsen,et al.  RNA secondary structure prediction using stochastic context-free grammars and evolutionary history , 1999, Bioinform..

[20]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[21]  James W. Brown The ribonuclease P database , 1998, Nucleic Acids Res..

[22]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[23]  M. Zuker Computer prediction of RNA structure. , 1989, Methods in enzymology.

[24]  P. Avner,et al.  X-chromosome inactivation: counting, choice and initiation , 2001, Nature Reviews Genetics.

[25]  Ceriel J. H. Jacobs,et al.  A programmer‐friendly LL(1) parser generator , 1988, Softw. Pract. Exp..

[26]  Serafim Batzoglou,et al.  CONTRAfold: RNA secondary structure prediction without physics-based models , 2006, ISMB.

[27]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[28]  Boris Lenhard,et al.  RNAdb—a comprehensive mammalian noncoding RNA database , 2004, Nucleic Acids Res..

[29]  Yi Zhao,et al.  NONCODE: an integrated knowledge database of non-coding RNAs , 2004, Nucleic Acids Res..

[30]  Bin Ma,et al.  DNACompress: fast and effective DNA sequence compression , 2002, Bioinform..

[31]  Laurent Lestrade,et al.  snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs , 2005, Nucleic Acids Res..

[32]  Kosaburo Hashiguchi,et al.  Limitedness Theorem on Finite Automata with Distance Functions , 1982, J. Comput. Syst. Sci..

[33]  Ravi Sethi,et al.  Yacc: a parser generator , 1990 .

[34]  V. Corces,et al.  The Drosophila micropia retrotransposon encodes a testis-specific antisense RNA complementary to reverse transcriptase , 1994, Molecular and cellular biology.

[35]  Sin Lam Tan,et al.  Complex Loci in Human and Mouse Genomes , 2006, PLoS genetics.

[36]  Jeremy Campbell,et al.  Grammatical Man: Information, Entropy, Language and Life , 1982 .

[37]  Batey,et al.  Tertiary Motifs in RNA Structure and Folding. , 1999, Angewandte Chemie.

[38]  R. Batey,et al.  The First Boron-Tethered Radical Cyclizations and Intramolecular Homolytic Substitutions at Boron. , 1999, Angewandte Chemie.

[39]  S. Steinberg,et al.  Importance of the reverse Hoogsteen base pair 54-58 for tRNA function. , 2003, Nucleic acids research.

[40]  James M. Carothers,et al.  Informational Complexity and Functional Activity of RNA Structures , 2004, Journal of the American Chemical Society.

[41]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[42]  Klara Kedem,et al.  RNA motif search using the structure to string (STR/sup 2/) method , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..