A tf-idf based topic model for identifying lncRNAs from genomic background

The developments in high throughput technologies identified a large number of long non-coding RNAs (lncRNAs) whose functional characterization remains an open problem. The available research confirmed that lncRNA plays a major role in genetic and epigenetic regulation, and its expression level has a significant association with some complex diseases like cancers. The identification of lncRNA and their functional characterization is an important task in RNA Bioinformatics. In spite of their abundance in the cell, lncRNAs are less conserved at their sequence level which makes the analysis challenging. Many machine learning based models are developed in the literature for the identification and analysis of lncRNAs. This paper proposes a topic model based method for the identification of lncRNAs. To investigate the applicability of topic model in lncRNA analysis, this work develops an LDA based topic model to group lncRNAs from a collection of transcriptome sequences. The features derived from transformed k-mer patterns and secondary structure of lncRNA sequences are used for the topic model. The results are promising compared to the classic algorithms and prove that the topic models are reasonable for lncRNA analysis.

[1]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[2]  Cong Pian,et al.  LncRNApred: Classification of Long Non-Coding RNAs and Protein-Coding Transcripts by the Ensemble Algorithm with a New Hybrid Feature , 2016, PloS one.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  Marco Masseroli,et al.  Latent Dirichlet Allocation based on Gibbs Sampling for gene function prediction , 2014, 2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology.

[5]  Michele Ceccarelli,et al.  Identification of long non-coding transcripts with feature selection: a comparative study , 2017, BMC Bioinformatics.

[6]  Alessandro Perina,et al.  Biologically-aware Latent Dirichlet Allocation (BaLDA) for the Classification of Expression Microarray , 2010, PRIB.

[7]  Jia Meng,et al.  lncRScan-SVM: A Tool for Predicting Long Non-Coding RNAs Using Support Vector Machine , 2015, PloS one.

[8]  Alessandro Perina,et al.  Expression microarray classification using topic models , 2010, SAC '10.

[9]  Weizhong Zhao,et al.  Topic modeling for cluster analysis of large biological and medical datasets , 2014, BMC Bioinformatics.

[10]  Shruti Kapoor,et al.  Computational approaches towards understanding human long non-coding RNA biology , 2015, Bioinform..

[11]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[12]  Yuping Wang,et al.  A novel procedure on next generation sequencing data analysis using text mining algorithm , 2016, BMC Bioinformatics.

[13]  Yanchun Liang,et al.  Long Noncoding RNA Identification: Comparing Machine Learning Based Tools for Long Noncoding Transcripts Discrimination , 2016, BioMed research international.

[14]  Shaowen Yao,et al.  An overview of topic modeling and its current applications in bioinformatics , 2016, SpringerPlus.

[15]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[16]  Weidong Tian,et al.  Molecular Mechanisms and Function Prediction of Long Noncoding RNA , 2012, TheScientificWorldJournal.

[17]  Shuigeng Zhou,et al.  Exploiting topic modeling to boost metagenomic reads binning , 2015, BMC Bioinformatics.

[18]  Ivo L. Hofacker,et al.  Vienna RNA secondary structure server , 2003, Nucleic Acids Res..

[19]  Xuegong Zhang,et al.  Computational prediction of associations between long non-coding RNAs and proteins , 2013, BMC Genomics.

[20]  Kenta Nakai,et al.  A study on the application of topic models to motif finding algorithms , 2016, BMC Bioinformatics.

[21]  Antonino Fiannaca,et al.  Probabilistic topic modeling for the analysis and classification of genomic sequences , 2015, BMC Bioinformatics.

[22]  Jiajie Peng,et al.  LncRNA2Function: a comprehensive resource for functional investigation of human lncRNAs based on RNA-seq data , 2015, BMC Genomics.