Design and implementation of a cyberinfrastructure for RNA motif search, prediction and analysis

DESIGN AND IMPLEMENTATION OF A CYBERINFRASTRUCTURE FOR RNA MOTIF SEARCH, PREDICTION AND ANALYSIS by Dongrong Wen RNA secondary and tertiary structure motifs play important roles in cells. However, very few web servers are available for RNA motif search and prediction. In this dissertation, a cyberinfrastructure, named RNAcyber, capable of performing RNA motif search and prediction, is proposed, designed and implemented. The first component of RNAcyber is a web-based search engine, named RmotifDB. This web-based tool integrates an RNA secondary structure comparison algorithm with the secondary structure motifs stored in the Rfam database. With a userfriendly interface, RmotifDB provides the ability to search for ncRNA structure motifs in both structural and sequential ways. The second component of RNAcyber is an enhanced version of RmotifDB. This enhanced version combines data from multiple sources, incorporates a variety of well-established structure-based search methods, and is integrated with the Gene Ontology. To display RmotifDB’s search results, a software tool, called RSview, is developed. RSview is able to display the search results in a graphical manner. Finally, RNAcyber contains a web-based tool called Junction-Explorer, which employs a data mining method for predicting tertiary motifs in RNA junctions. Specifically, the tool is trained on solved RNA tertiary structures obtained from the Protein Data Bank, and is able to predict the configuration of coaxial helical stacks and families (topologies) in RNA junctions at the secondary structure level. JunctionExplorer employs several algorithms for motif prediction, including a random forest classification algorithm, a pseudoknot removal algorithm, and a feature ranking algorithm based on the gini impurity measure. A series of experiments including 10-fold crossvalidation has been conducted to evaluate the performance of the Junction-Explorer tool. Experimental results demonstrate the effectiveness of the proposed algorithms and the superiority of the tool over existing methods. The RNAcyber infrastructure is fully operational, with all of its components accessible on the Internet. DESIGN AND IMPLEMENTATION OF A CYBERINFRASTRUCTURE FOR RNA MOTIF SEARCH, PREDICTION AND ANALYSIS

[1]  Eckart Bindewald,et al.  RNAJunction: a database of RNA junctions and kissing loops for three-dimensional structural analysis and nanodesign , 2007, Nucleic Acids Res..

[2]  John Quackenbush,et al.  Knowledge-Based Access to the Bio-Medical Literature, Ontologically-Grounded Experiments for the TREC 2003 Genomics Track , 2003, TREC.

[3]  Thomas A. Steitz,et al.  RNA tertiary interactions in the large ribosomal subunit: The A-minor motif , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Peter Mork,et al.  The Multiple Roles of Ontologies in the BioMediator Data Integration System , 2005, DILS.

[5]  Jason Tsong-Li Wang,et al.  Kernel design for RNA classification using Support Vector Machines , 2006, Int. J. Data Min. Bioinform..

[6]  T. Schlick,et al.  Analysis of four-way junctions in RNA structures. , 2009, Journal of molecular biology.

[7]  Anthony Kosky,et al.  Extending traditional query-based integration approaches for functional characterization of post-genomic data , 2001, Bioinform..

[8]  Ulf Leser,et al.  Issues in developing integrated genomic databases and application to the human X chromosome , 1998, Bioinform..

[9]  David H Mathews,et al.  Predicting helical coaxial stacking in RNA multibranch loops. , 2007, RNA.

[10]  Amedeo Napoli,et al.  SNP-Converter: An Ontology-Based Solution to Reconcile Heterogeneous SNP Descriptions for Pharmacogenomic Studies , 2006, DILS.

[11]  Graziano Pesole,et al.  UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs , 2000, Nucleic Acids Res..

[12]  Kaizhong Zhang,et al.  RADAR: An InteractiveWeb-Based Toolkit for RNA Data Analysis and Research , 2006, Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06).

[13]  J. Sabina,et al.  Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. , 1999, Journal of molecular biology.

[14]  K. Katz,et al.  Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. , 2000, Trends in genetics : TIG.

[15]  Eckart Bindewald,et al.  CorreLogo: an online server for 3D sequence logos of RNA and DNA alignments , 2006, Nucleic Acids Res..

[16]  Paul P. Wang,et al.  Computational Biology and Genome Informatics , 2003 .

[17]  Susan B. Davidson,et al.  A User-Centric Framework for Accessing Biological Sources and Tools , 2005, DILS.

[18]  R. Knight,et al.  From knotted to nested RNA structures: a variety of computational methods for pseudoknot removal. , 2008, RNA.

[19]  Graziano Pesole,et al.  PatSearch: a program for the detection of patterns and structural motifs in nucleotide sequences , 2003, Nucleic Acids Res..

[20]  T. Schlick,et al.  Tertiary motifs revealed in analyses of higher-order RNA junctions. , 2009, Journal of molecular biology.

[21]  Sean R. Eddy,et al.  Rfam: annotating non-coding RNAs in complete genomes , 2004, Nucleic Acids Res..

[22]  E. Westhof,et al.  Geometric nomenclature and classification of RNA base pairs. , 2001, RNA.

[23]  D. Lilley,et al.  Structure of the three-way helical junction of the hepatitis C virus IRES element. , 2010, RNA.

[24]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[25]  E. Westhof,et al.  Topology of three-way junctions in folded RNAs. , 2006, RNA.

[26]  Peter Dalgaard,et al.  Introductory statistics with R , 2002, Statistics and computing.

[27]  John D. Westbrook,et al.  Tools for the automatic identification and classification of RNA base pairs , 2003, Nucleic Acids Res..

[28]  Olga G. Troyanskaya,et al.  A scalable method for integration and functional analysis of multiple microarray datasets , 2006, Bioinform..

[29]  Anne Condon,et al.  RNA STRAND: The RNA Secondary Structure and Statistical Analysis Database , 2008, BMC Bioinformatics.

[30]  Eric Westhof,et al.  Sequence to Structure (S2S): display, manipulate and interconnect RNA data from sequence to structure , 2005, Bioinform..

[31]  N. Gray,et al.  Regulation of mRNA translation by 5'- and 3'-UTR-binding factors. , 2003, Trends in biochemical sciences.

[32]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Ivo L. Hofacker,et al.  Vienna RNA secondary structure server , 2003, Nucleic Acids Res..

[34]  Val Tannen,et al.  K2/Kleisli and GUS: Experiments in integrated access to genomic data sources , 2001, IBM Syst. J..

[35]  D. Turner,et al.  Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Christian Laing,et al.  Computational approaches to 3D modeling of RNA , 2010, Journal of physics. Condensed matter : an Institute of Physics journal.

[37]  Dennis Shasha,et al.  New Techniques for DNA Sequence Classification , 1999, J. Comput. Biol..

[38]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[39]  Gultekin Özsoyoglu,et al.  Pathways Database System: An Integrated System for Biological Pathways , 2003, Bioinform..

[40]  Kaizhong Zhang,et al.  Automated Discovery of Active Motifs in Multiple RNA Secondary Structures , 1996, KDD.

[41]  Judith A. Blake,et al.  Beyond the data deluge: Data integration and bio-ontologies , 2006, J. Biomed. Informatics.

[42]  Robert Giegerich,et al.  Pure multiple RNA secondary structure alignments: a progressive profile approach , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[43]  T. Schlick,et al.  Annotation of tertiary interactions in RNA structures reveals variations and correlations. , 2008, RNA.

[44]  Laxmi Parida Pattern Discovery in Biomolecular Data: Tools, Techniques and Applications , 1999 .

[45]  Gary D. Stormo,et al.  Phylogenetically enhanced statistical tools for RNA structure prediction , 2000, Bioinform..

[46]  Nan Yu,et al.  The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs , 2002, BMC Bioinformatics.

[47]  Jason Tsong-Li Wang,et al.  Scientific Data Mining: A Case Study , 1998, Int. J. Softw. Eng. Knowl. Eng..

[48]  R. Duronio,et al.  Histone mRNA expression: multiple levels of cell cycle regulation and important developmental consequences. , 2002, Current opinion in cell biology.

[49]  David H. Mathews,et al.  RNAstructure: software for RNA secondary structure prediction and analysis , 2010, BMC Bioinformatics.

[50]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[51]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[52]  Victor M. Markowitz An Application Driven Perspective on Biological Data Integration , 2006, DILS.

[53]  Louiqa Raschid,et al.  Techniques for Optimization of Queries on Integrated Biological Resources , 2004, J. Bioinform. Comput. Biol..

[54]  Simone Santini,et al.  On Querying OBO Ontologies Using a DAG Pattern Query Language , 2006, DILS.

[55]  Sean R. Eddy,et al.  A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure , 2002, BMC Bioinformatics.

[56]  N. Seeman,et al.  The general structure of transfer RNA molecules. , 1974, Proceedings of the National Academy of Sciences of the United States of America.

[57]  K.C. Wiese,et al.  jViz.Rna -a java tool for RNA secondary structure visualization , 2005, IEEE Transactions on NanoBioscience.

[58]  B. Shapiro,et al.  RNA secondary structure prediction from sequence alignments using a network of k-nearest neighbor classifiers. , 2006, RNA.

[59]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[60]  C. Burge,et al.  Prediction of Mammalian MicroRNA Targets , 2003, Cell.

[61]  Kaizhong Zhang,et al.  An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[62]  Sean R. Eddy,et al.  RSEARCH: Finding homologs of single structured RNA sequences , 2003, BMC Bioinformatics.

[63]  Graziano Pesole,et al.  UTRdb and UTRsite: a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs , 2004, Nucleic Acids Res..

[64]  G. Stormo,et al.  Discovering common stem-loop motifs in unaligned RNA sequences. , 2001, Nucleic acids research.

[65]  Tala Bakheet,et al.  ARED: human AU-rich element-containing mRNA database reveals an unexpectedly diverse functional repertoire of encoded proteins , 2001, Nucleic Acids Res..

[66]  Sumeet Dua,et al.  Data Mining in Bioinformatics , 2012, Encyclopedia of Database Systems.

[67]  Bertram Ludäscher,et al.  Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data , 2006, DILS.

[68]  Ela Hunt,et al.  Index-Driven XML Data Integration to Support Functional Genomics , 2004, DILS.

[69]  Jun Hu,et al.  A method for aligning RNA secondary structures and its application to RNA motif detection , 2005, BMC Bioinformatics.