A Tool Preference Selection Method for RNA Secondary Structure Prediction with SVM

Prediction of RNA secondary structures has drawn much attention from both biologists and computer scientists. Many useful tools have been developed for this purpose, with or without pseudoknots. These tools have their individual strength and weakness. As a result, we propose a tool preference selection method which integrates three prediction tools pknotsRG, RNAstructure and NUPACK with support vector machines (SVM). Our method starts with extracting features from the target RNA sequences, and adopt the information-theoretic feature selection method for feature ranking. We propose a method to combine feature selection and classifier fusion, namely incremental mRMR. The test data set contains 720 RNA sequences, where 225 pseudoknotted RNA sequences are obtained from PseudoBase, and 495 nested RNA sequences are obtained from RNA SSTRAND. Our method serves as a preprocessing way in analyzing RNA sequences before the RNA secondary structure prediction tools are employed. Experimental results show that our method improves not only the classification accuracy, but also the base-pair accuracies.

[1]  Timothy Clark,et al.  2D-dynamic representation of DNA sequences , 2007 .

[2]  Kuldip Singh,et al.  A Time-Series-Based Feature Extraction Approach for Prediction of Protein Structural Class , 2008, EURASIP J. Bioinform. Syst. Biol..

[3]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[4]  Wieslaw Nowak,et al.  Distribution moments of 2D-graphs as descriptors of DNA sequences , 2007 .

[5]  Jun Wang,et al.  Characterization and similarity analysis of DNA sequences based on mutually direct-complementary triplets☆ , 2006 .

[6]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[7]  Michael Zuker,et al.  Mfold web server for nucleic acid folding and hybridization prediction , 2003, Nucleic Acids Res..

[8]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Ming-Kuei Hu,et al.  Visual pattern recognition by moment invariants , 1962, IRE Trans. Inf. Theory.

[10]  Niles A. Pierce,et al.  An algorithm for computing nucleic acid base‐pairing probabilities including pseudoknots , 2004, J. Comput. Chem..

[11]  D. Turner,et al.  Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[12]  E Rivas,et al.  A dynamic programming algorithm for RNA structure prediction including pseudoknots. , 1998, Journal of molecular biology.

[13]  Hiroshi Matsui,et al.  Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[14]  J. McCaskill The equilibrium partition function and base pair binding probabilities for RNA secondary structure , 1990, Biopolymers.

[15]  Robert Giegerich,et al.  Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics , 2004, BMC Bioinformatics.

[16]  Loet Leydesdorff,et al.  Co-occurrence matrices and their applications in information science: Extending ACA to the Web environment , 2006, J. Assoc. Inf. Sci. Technol..

[17]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[18]  Todd R. Ogden,et al.  Wavelet Methods for Time Series Analysis , 2002 .

[19]  Dejan Plavšić,et al.  Novel 2-D graphical representation of DNA sequences and their numerical characterization , 2003 .

[20]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[21]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[22]  Accuracy Improvement for RNA Secondary Structure Prediction with SVM ∗ , 2008 .

[23]  Robert Giegerich,et al.  pknotsRG: RNA pseudoknot folding including near-optimal structures and sliding windows , 2007, Nucleic Acids Res..

[24]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[25]  M Nirenberg,et al.  RNA codewords and protein synthesis, VII. On the general nature of the RNA code. , 1965, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Russell L. Malmberg,et al.  Stochastic modeling of RNA pseudoknotted structures: a grammatical approach , 2003, ISMB.

[27]  F. Tahi P-DCFold : an algorithm for RNA secondary structure prediction including all kinds of pseudoknots , 2003 .

[28]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[29]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[30]  Dejan Plavšić,et al.  Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation , 2003 .

[31]  Chuen-Der Huang,et al.  Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification , 2003, IEEE Transactions on NanoBioscience.

[32]  M. Nirenberg,et al.  RNA Codewords and Protein Synthesis , 1964, Science.