PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme

BackgroundHigh-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene annotations, genomic sequences and high-quality sequencing.ResultsWe present an alignment-free tool called PLEK (predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme), which uses a computational pipeline based on an improved k-mer scheme and a support vector machine (SVM) algorithm to distinguish lncRNAs from messenger RNAs (mRNAs), in the absence of genomic sequences or annotations. The performance of PLEK was evaluated on well-annotated mRNA and lncRNA transcripts. 10-fold cross-validation tests on human RefSeq mRNAs and GENCODE lncRNAs indicated that our tool could achieve accuracy of up to 95.6%. We demonstrated the utility of PLEK on transcripts from other vertebrates using the model built from human datasets. PLEK attained >90% accuracy on most of these datasets. PLEK also performed well using a simulated dataset and two real de novo assembled transcriptome datasets (sequenced by PacBio and 454 platforms) with relatively high indel sequencing errors. In addition, PLEK is approximately eightfold faster than a newly developed alignment-free tool, named Coding-Non-Coding Index (CNCI), and 244 times faster than the most popular alignment-based tool, Coding Potential Calculator (CPC), in a single-threading running manner.ConclusionsPLEK is an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes. PLEK is especially suitable for PacBio or 454 sequencing data and large-scale transcriptome data. Its open-source software can be freely downloaded from https://sourceforge.net/projects/plek/files/.

[1]  Manolis Kellis,et al.  PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions , 2011, Bioinform..

[2]  U. Stenzel,et al.  Parallel tagged sequencing on the 454 platform , 2008, Nature Protocols.

[3]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[4]  Liping Wei,et al.  A long-term demasculinization of X-linked intergenic noncoding RNAs in Drosophila melanogaster , 2014, Genome research.

[5]  Zhong Wang,et al.  Next-generation transcriptome assembly , 2011, Nature Reviews Genetics.

[6]  Sébastien Renaut,et al.  Mining transcriptome sequences towards identifying adaptive single nucleotide polymorphisms in lake whitefish species pairs (Coregonus spp. Salmonidae) , 2010, Molecular ecology.

[7]  M. Rosenfeld,et al.  LncRNA-Dependent Mechanisms of Androgen Receptor-regulated Gene Activation Programs , 2013, Nature.

[8]  J. Marden,et al.  Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing , 2008, Molecular ecology.

[9]  Huaiqiu Zhu,et al.  Gene prediction in metagenomic fragments based on the SVM algorithm , 2011, 2011 4th International Conference on Biomedical Engineering and Informatics (BMEI).

[10]  Olivier Elemento,et al.  Faster sequencers, larger datasets, new challenges , 2012, Genome Biology.

[11]  Yi Zhao,et al.  Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts , 2013, Nucleic acids research.

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  Howard Y. Chang,et al.  Gene regulation: Long RNAs wire up cancer growth , 2013, Nature.

[14]  F. Luciani,et al.  Next generation deep sequencing and vaccine design: today and tomorrow , 2012, Trends in Biotechnology.

[15]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[16]  David G. Knowles,et al.  The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression , 2012, Genome research.

[17]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[18]  Chris P. Ponting,et al.  Identification and Properties of 1,119 Candidate LincRNA Loci in the Drosophila melanogaster Genome , 2012, Genome biology and evolution.

[19]  Howard Y. Chang,et al.  Long noncoding RNAs and human disease. , 2011, Trends in cell biology.

[20]  Michael F. Lin,et al.  Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis. , 2012, Genome research.

[21]  Louisa Flintoft,et al.  Evolution: Speciation meets microbiomes , 2013, Nature Reviews Genetics.

[22]  Xiang Du,et al.  The long non-coding RNAs, a new cancer diagnostic and therapeutic gold mine , 2013, Modern Pathology.

[23]  Brian R. King,et al.  Mining for class-specific motifs in protein sequence classification , 2012, BMC Bioinformatics.

[24]  T. Dallman,et al.  Performance comparison of benchtop high-throughput sequencing platforms , 2012, Nature Biotechnology.

[25]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[26]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[27]  C. Glass,et al.  Induced ncRNAs Allosterically Modify RNA Binding Proteins in cis to Inhibit Transcription , 2008, Nature.

[28]  Hui-Hsien Chou,et al.  DNA sequence quality trimming and vector removal , 2001, Bioinform..

[29]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[30]  Cole Trapnell,et al.  Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. , 2011, Genes & development.

[31]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[32]  Yi Zhang,et al.  A k-mer scheme to predict piRNAs and characterize locust piRNAs , 2011, Bioinform..

[33]  Shuigeng Zhou,et al.  miRFam: an effective automatic miRNA classification method based on n-grams and a multiclass SVM , 2011, BMC Bioinformatics.

[34]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[35]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[36]  J. Harrow,et al.  GENCODE: producing a reference annotation for ENCODE , 2006, Genome Biology.

[37]  Mark Gerstein,et al.  Accurate Identification and Analysis of Human mRNA Isoforms Using Deep Long Read Sequencing , 2013, G3: Genes, Genomes, Genetics.

[38]  D. Bartel,et al.  lincRNAs: Genomics, Evolution, and Mechanisms , 2013, Cell.

[39]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature biotechnology.

[40]  Jason Chuang,et al.  RNA sequencing reveals a diverse and dynamic repertoire of the Xenopus tropicalis transcriptome over development , 2012, Genome research.

[41]  M. Gorospe,et al.  Long Noncoding RNA MALAT1 Controls Cell Cycle Progression by Regulating the Expression of Oncogenic Transcription Factor B-MYB , 2013, PLoS genetics.

[42]  Ying Wang,et al.  Development of a EST dataset and characterization of EST-SSRs in a traditional Chinese medicinal plant, Epimedium sagittatum (Sieb. Et Zucc.) Maxim , 2010, BMC Genomics.

[43]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature Biotechnology.

[44]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[45]  Ya-ping Zhang,et al.  Genome-Wide Identification of Long Intergenic Noncoding RNA Genes and Their Potential Association with Domestication in Pigs , 2014, Genome biology and evolution.

[46]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[47]  Laurent Gil,et al.  Ensembl 2013 , 2012, Nucleic Acids Res..

[48]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy , 2011, Nucleic Acids Res..

[49]  T. Kunkel,et al.  Mechanism of a genetic glissando: structural biology of indel mutations. , 2006, Trends in biochemical sciences.

[50]  Howard Y. Chang,et al.  Long Noncoding RNAs: Cellular Address Codes in Development and Disease , 2013, Cell.

[51]  R. Gibbs,et al.  Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology , 2012, PloS one.

[52]  R. Kurokawa,et al.  Long Noncoding RNAs , 2015, Springer Japan.

[53]  Yong Zhang,et al.  CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine , 2007, Nucleic Acids Res..

[54]  Steven R. Eichten,et al.  Genome-wide discovery and characterization of maize long non-coding RNAs , 2014, Genome Biology.

[55]  D. Bartel,et al.  Long noncoding RNAs in C. elegans , 2012, Genome research.

[56]  Albert E. Almada,et al.  Divergent transcription of long noncoding RNA/mRNA gene pairs in embryonic stem cells , 2013, Proceedings of the National Academy of Sciences.

[57]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[58]  Christoph Dieterich,et al.  De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics. , 2011, Genome research.

[59]  J. Mattick,et al.  Long non-coding RNAs: insights into functions , 2009, Nature Reviews Genetics.

[60]  Louisa Flintoft,et al.  Non-coding RNA: Structure and function for lncRNAs , 2013, Nature Reviews Genetics.

[61]  J. Jackson,et al.  Next-generation pyrosequencing of gonad transcriptomes in the polyploid lake sturgeon (Acipenser fulvescens): the relative merits of normalization and rarefaction in gene discovery , 2009, BMC Genomics.