Robust and accurate prediction of noncoding RNAs from aligned sequences

BackgroundComputational prediction of noncoding RNAs (ncRNAs) is an important task in the post-genomic era. One common approach is to utilize the profile information contained in alignment data rather than single sequences. However, this strategy involves the possibility that the quality of input alignments can influence the performance of prediction methods. Therefore, the evaluation of the robustness against alignment errors is necessary as well as the development of accurate prediction methods.ResultsWe describe a new method, called Profile BPLA kernel, which predicts ncRNAs from alignment data in combination with support vector machines (SVMs). Profile BPLA kernel is an extension of base-pairing profile local alignment (BPLA) kernel which we previously developed for the prediction from single sequences. By utilizing the profile information of alignment data, the proposed kernel can achieve better accuracy than the original BPLA kernel. We show that Profile BPLA kernel outperforms the existing prediction methods which also utilize the profile information using the high-quality structural alignment dataset. In addition to these standard benchmark tests, we extensively evaluate the robustness of Profile BPLA kernel against errors in input alignments. We consider two different types of error: first, that all sequences in an alignment are actually ncRNAs but are aligned ignoring their secondary structures; second, that an alignment contains unrelated sequences which are not ncRNAs but still aligned. In both cases, the effects on the performance of Profile BPLA kernel are surprisingly small. Especially for the latter case, we demonstrate that Profile BPLA kernel is more robust compared to the existing prediction methods.ConclusionsProfile BPLA kernel provides a promising way for identifying ncRNAs from alignment data. It is more accurate than the existing prediction methods, and can keep its performance under the practical situations in which the quality of input alignments is not necessarily high.

[1]  Sebastian Will,et al.  RNAalifold: improved consensus structure prediction for RNA alignments , 2008, BMC Bioinformatics.

[2]  S. Altschul,et al.  Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. , 1985, Molecular biology and evolution.

[3]  Ivo L. Hofacker,et al.  Vienna RNA secondary structure server , 2003, Nucleic Acids Res..

[4]  S. Eddy Computational Genomics of Noncoding RNA Genes , 2002, Cell.

[5]  David Haussler,et al.  The UCSC Genome Browser database: update 2010 , 2009, Nucleic Acids Res..

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  J. Gorodkin,et al.  Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. , 2006, Genome research.

[8]  Peter F Stadler,et al.  Fast and reliable prediction of noncoding RNAs , 2005, Proc. Natl. Acad. Sci. USA.

[9]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[10]  Sean R. Eddy,et al.  Infernal 1.0: inference of RNA alignments , 2009, Bioinform..

[11]  Kiyoshi Asai,et al.  Prediction of RNA secondary structure using generalized centroid estimators , 2009, Bioinform..

[12]  Peter F. Stadler,et al.  RNAz 2.0: Improved Noncoding RNA Detection , 2010, Pacific Symposium on Biocomputing.

[13]  Yasubumi Sakakibara,et al.  Gradient-based optimization of hyperparameters for base-pairing profile local alignment kernels. , 2009, Genome informatics. International Conference on Genome Informatics.

[14]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[15]  A. Hüttenhofer,et al.  Non-coding RNAs: hope or hype? , 2005, Trends in genetics : TIG.

[16]  Michael Zuker,et al.  Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information , 1981, Nucleic Acids Res..

[17]  Kiyoshi Asai,et al.  Robust prediction of consensus secondary structures using averaged base pairing probability matrices , 2007, Bioinform..

[18]  Ivo L Hofacker,et al.  RNAs everywhere: genome-wide annotation of structured RNAs. , 2006, Genome informatics. International Conference on Genome Informatics.

[19]  Tanja Gesell,et al.  Dinucleotide controlled null models for comparative RNA gene prediction , 2008, BMC Bioinformatics.

[20]  Kiyoshi Asai,et al.  Stem Kernels for RNA Sequence Analyses , 2007, BIRD.

[21]  Ting Wang,et al.  The UCSC Genome Browser Database: update 2009 , 2008, Nucleic Acids Res..

[22]  Sean R. Eddy,et al.  RSEARCH: Finding homologs of single structured RNA sequences , 2003, BMC Bioinformatics.

[23]  W. L. Ruzzo,et al.  Comparative genomics beyond sequence-based alignments: RNA structures in the ENCODE regions. , 2008, Genome research.

[24]  Y. Sakakibara,et al.  Genome-wide searching with base-pairing kernel functions for noncoding RNAs: computational and expression analysis of snoRNA families in Caenorhabditis elegans , 2009, Nucleic acids research.

[25]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[26]  J. McCaskill The equilibrium partition function and base pair binding probabilities for RNA secondary structure , 1990, Biopolymers.

[27]  BMC Bioinformatics , 2005 .

[28]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[29]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[30]  Sonja J. Prohaska,et al.  RNAs everywhere: genome-wide annotation of structured RNAs. , 2007, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[31]  Chuan-Sheng Foo,et al.  A max-margin model for efficient simultaneous alignment and folding of RNA sequences , 2008, ISMB.

[32]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[33]  Kiyoshi Asai,et al.  Directed acyclic graph kernels for structural RNA analysis , 2008, BMC Bioinformatics.

[34]  Deniz Dalli,et al.  StrAl: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time , 2006, Bioinform..

[35]  A. Prakash,et al.  Measuring the accuracy of genome-size multiple alignments , 2007, Genome Biology.

[36]  Walter L. Ruzzo,et al.  How accurately is ncRNA aligned within whole-genome multiple alignments? , 2007, BMC Bioinformatics.

[37]  Peter S. Pacheco Parallel programming with MPI , 1996 .

[38]  Sean R. Eddy,et al.  Infernal 1.0: inference of RNA alignments , 2009, Bioinform..

[39]  Robert D. Finn,et al.  Rfam: updates to the RNA families database , 2008, Nucleic Acids Res..