LncPred-IEL: A Long Non-coding RNA Prediction Method using Iterative Ensemble Learning

A large number of transcripts have been generated by the development of high throughput sequencing technologies. Predicting lncRNA from transcripts is a challenging and important task. In this paper, we propose LncPred-IEL, an iterative ensemble learning long non-coding RNA prediction method. LncPred-IEL not only considers features widely used for the lncRNA prediction, but also take into account sequence-derived features used in the RNA sequence classification, so as to make use of diverse information. LncPred-IEL builds base predictors based on different groups of features, and employs a supervised iterative way to combine base predictors and build ensemble models. Our studies demonstrate that supervised iterative way can learn the representations that help to separate lncRNA and protein-coding transcripts, and further improve the performances. Experiments demonstrate that LncPred-IEL outperforms several state-of-the-art methods when evaluated by 10-fold cross-validation. The capability of LncPred-IEL for the cross-species prediction is also tested. As complementary to wet experiments, LncPred-IEL is a useful computational tool for lncRNA prediction.

[1]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[2]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  William Stafford Noble,et al.  Predicting the in vivo signature of human gene regulatory sequence , 2005, ISMB.

[4]  Yong Zhang,et al.  CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine , 2007, Nucleic Acids Res..

[5]  William Stafford Noble,et al.  Predicting Human Nucleosome Occupancy from Primary Sequence , 2008, PLoS Comput. Biol..

[6]  Shuigeng Zhou,et al.  A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation , 2009, Bioinform..

[7]  Yusuke Nakamura,et al.  Association of a novel long non‐coding RNA in 8q24 with prostate cancer susceptibility , 2011, Cancer science.

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  A. Wierzbicki The role of long non-coding RNA in transcriptional gene silencing. , 2012, Current opinion in plant biology.

[10]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[11]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[12]  M. Rosenfeld,et al.  LncRNA-Dependent Mechanisms of Androgen Receptor-regulated Gene Activation Programs , 2013, Nature.

[13]  Nian Liu,et al.  Probing N6-methyladenosine RNA modification status at single nucleotide resolution in mRNA and long noncoding RNA , 2013, RNA.

[14]  Yi Zhao,et al.  Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts , 2013, Nucleic acids research.

[15]  Aimin Li,et al.  PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme , 2014, BMC Bioinformatics.

[16]  J. Kocher,et al.  CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model , 2013, Nucleic acids research.

[17]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[18]  Jia Meng,et al.  lncRScan-SVM: A Tool for Predicting Long Non-Coding RNAs Using Support Vector Machine , 2015, PloS one.

[19]  Shaowu Zhang,et al.  lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning. , 2015, Molecular bioSystems.

[20]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[21]  Gang Tian,et al.  Accurate Prediction of Transposon-Derived piRNAs by Integrating Various Sequential and Physicochemical Features , 2016, PloS one.

[22]  Giuseppe Tradigo,et al.  On the identification of long non-coding RNAs from RNA-seq , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[23]  Cong Pian,et al.  LncRNApred: Classification of Long Non-Coding RNAs and Protein-Coding Transcripts by the Ensemble Algorithm with a New Hybrid Feature , 2016, PloS one.

[24]  Fei Guo,et al.  Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier , 2017, Artif. Intell. Medicine.

[25]  Ge Gao,et al.  CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features , 2017, Nucleic Acids Res..

[26]  Peter F Stadler,et al.  A Support Vector Machine based method to distinguish long non-coding RNAs from protein coding transcripts , 2017, BMC Genomics.

[27]  K. Lindblad-Toh,et al.  FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome , 2017, Nucleic acids research.

[28]  S. Rigatti Random Forest. , 2017, Journal of insurance medicine.

[29]  Dingfang Li,et al.  Predicting small RNAs in bacteria via sequence learning ensemble method , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[30]  Z. Lu,et al.  COME: a robust coding potential calculation tool for lncRNA identification and characterization based on multiple features , 2016, Nucleic acids research.

[31]  Jing Hu,et al.  Distinguishing long non-coding RNAs from mRNAs using a two-layer structured classifier , 2017, 2017 IEEE 7th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[32]  Leyi Wei,et al.  A novel hierarchical selective ensemble classifier with bioinformatics application , 2017, Artif. Intell. Medicine.

[33]  Feng Huang,et al.  SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions , 2018, PLoS Comput. Biol..

[34]  Astrid Gall,et al.  Ensembl 2018 , 2017, Nucleic Acids Res..

[35]  Caitlin M. A. Simopoulos,et al.  Prediction of plant lncRNA by ensemble machine learning classifiers , 2018, BMC Genomics.

[36]  May D. Wang,et al.  LncADeep: an ab initio lncRNA identification and functional annotation tool based on deep learning , 2018, Bioinform..

[37]  Byunghan Lee,et al.  LncRNAnet: long non‐coding RNA identification using deep learning , 2018, Bioinform..

[38]  Ran Su,et al.  Iterative feature representations improve N4-methylcytosine site prediction , 2019, Bioinform..

[39]  Ji Feng,et al.  Deep forest , 2017, IJCAI.

[40]  Feng Liu,et al.  PredLnc-GFStack: A Global Sequence Feature Based on a Stacked Ensemble Learning Method for Predicting lncRNAs from Transcripts , 2019, Genes.

[41]  Yanlin Chen,et al.  SFLLN: A sparse feature learning ensemble method with linear neighborhood regularization for predicting drug-drug interactions , 2019, Inf. Sci..

[42]  Xiaoxue Tong,et al.  CPPred: coding potential prediction based on the global description of RNA sequence , 2019, Nucleic acids research.