An Approach to Selecting Putative RNA Motifs Using MDL Principle

The history of molecular biology is punctuated by a series of discoveries demonstrating the surprising breadth of biological roles of ribonucleic acid (RNA). An ensemble of evolutionary related RNA sequences believed to contain signals at sequence and structure level can be exploited to detect motifs common to all or a portion of those sequences. Finding these similar structural features can provide substantial information as to which parts of the sequence are functional. For several decades, free energy minimization has been the most popular method for structure prediction. However, limitations of the free energy models as well as time complexity have prompted us to look for alternative approaches. We therefore, investigate another paradigm, minimum description length (MDL) encoding, for evaluating the significance of consensus motifs. Here, we evaluate motifs generated by Seed using the description length as a selection criteria. MDL scoring method was tested on four data sets of varying complexity. We found that the scoring method produces competing structures in comparison to the ones predicted with lowest free energy. The top rank motifs have high measures of positive predicted value to known motifs.

[1]  Mohammad Anwar,et al.  Evaluation of RNA Secondary Structure Motifs using Regression Analysis , 2006, 2006 Canadian Conference on Electrical and Computer Engineering.

[2]  S. Colowick,et al.  Methods in Enzymology , Vol , 1966 .

[3]  M. Zuker On finding all suboptimal foldings of an RNA molecule. , 1989, Science.

[4]  Bianca Zadrozny,et al.  Ranking-based evaluation of regression models , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[5]  Dennis Shasha,et al.  Application of neural networks to biological data mining: a case study in protein sequence classification , 2000, KDD '00.

[6]  G. Stormo,et al.  Discovering common stem-loop motifs in unaligned RNA sequences. , 2001, Nucleic acids research.

[7]  D. Turner,et al.  Predicting thermodynamic properties of RNA. , 1995, Methods in enzymology.

[8]  Mark A. Pitt,et al.  Advances in Minimum Description Length: Theory and Applications , 2005 .

[9]  Esko Ukkonen,et al.  Discovering Patterns and Subfamilies in Biosequences , 1996, ISMB.

[10]  E. Lai RNA Sensors and Riboswitches: Self-Regulating Messages , 2003, Current Biology.

[11]  Marcel Turcotte,et al.  Simultaneous alignment and structure prediction of three RNA sequences , 2005, Int. J. Bioinform. Res. Appl..

[12]  Jamie J. Cannone,et al.  Evaluation of the suitability of free-energy minimization using nearest-neighbor energy parameters for RNA secondary structure prediction , 2004, BMC Bioinformatics.

[13]  Mohammad Anwar,et al.  Identification of consensus RNA secondary structures using suffix arrays , 2006, BMC Bioinformatics.

[14]  D. Turner,et al.  Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. , 2002, Journal of molecular biology.

[15]  Robert Giegerich,et al.  Evaluating the predictability of conformational switching in RNA , 2004, Bioinform..

[16]  J. Sabina,et al.  Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. , 1999, Journal of molecular biology.

[17]  E. Nudler,et al.  The riboswitch control of bacterial metabolism. , 2004, Trends in biochemical sciences.

[18]  Graziano Pesole,et al.  UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs , 2000, Nucleic Acids Res..

[19]  Michael Zuker,et al.  Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information , 1981, Nucleic Acids Res..

[20]  Laxmi Parida Pattern Discovery in Biomolecular Data: Tools, Techniques and Applications , 1999 .