Prediction of Human LncRNAs Based on Integrated Information Entropy Features

The prediction of long non-coding RNA (lncRNA) has attracted great attention from researchers, as more and more evidence indicate that various complex human diseases are closely related to lncRNAs. In the era of bio-big data, in addition to the prediction of lncRNAs by biological experimental methods, many computational methods based on machine learning have been proposed to make better use of the sequence resources of existing long non-coding RNAs. We use a lncRNA prediction method by integrating information-entropy-based features and machine learning algorithms. We calculate generalized topological entropy and generate 6 novel features for lncRNA sequences. By employing these 6 features and other features such as open reading frame (ORF), we apply Supporting Vector Machine (SVM), XGBoost and Random Forest (RF) algorithms to distinguish human lncRNAs. We compare our method with the one which has more Kmer features and results show that our method has higher Area Under the Curve (AUC) up to 99.7905%. We develop an accurate and efficient method which has novel information entropy features to analyze and classify lncRNAs. Our method is also extendable for research on other functional elements in DNA sequences.

[1]  Jiajie Peng,et al.  A Generalized Topological Entropy for Analyzing the Complexity of DNA Sequences , 2014, PloS one.

[2]  Sarath Chandra Janga,et al.  Role of lncRNAs in health and disease-size and shape matter. , 2015, Briefings in functional genomics.

[3]  J. Mattick,et al.  The relationship between non-protein-coding DNA and eukaryotic complexity. , 2007, BioEssays : news and reviews in molecular, cellular and developmental biology.

[4]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[5]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[6]  Cole Trapnell,et al.  Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. , 2011, Genes & development.

[7]  Michael F. Lin,et al.  Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals , 2009, Nature.

[8]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[9]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[10]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[11]  Adam Godzik,et al.  Tolerating some redundancy significantly speeds up clustering of large protein databases , 2002, Bioinform..

[12]  J. Kocher,et al.  CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model , 2013, Nucleic acids research.

[13]  Yuan Zhang,et al.  LncRNA-ID: Long non-coding RNA IDentification using balanced random forests , 2015, Bioinform..

[14]  Junyi Wang,et al.  LncTar: a tool for predicting the RNA targets of long noncoding RNAs , 2015, Briefings Bioinform..

[15]  C. Yanofsky Establishing the Triplet Nature of the Genetic Code , 2007, Cell.

[16]  Alessio Colantoni,et al.  Revealing protein–lncRNA interaction , 2015, Briefings Bioinform..

[17]  Hiroshi Mamitsuka,et al.  Computational recognition for long non-coding RNA (lncRNA): Software and databases , 2016, Briefings Bioinform..

[18]  Carolyn J. Brown,et al.  The human XIST gene: Analysis of a 17 kb inactive X-specific RNA that contains conserved repeats and is highly localized within the nucleus , 1992, Cell.

[19]  Peter F Stadler,et al.  A Support Vector Machine based method to distinguish long non-coding RNAs from protein coding transcripts , 2017, BMC Genomics.

[20]  M. Esteller Non-coding RNAs in human disease , 2011, Nature Reviews Genetics.

[21]  David Koslicki,et al.  Topological entropy of DNA sequences , 2011, Bioinform..

[22]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Howard Y. Chang,et al.  Long noncoding RNA HOTAIR reprograms chromatin state to promote cancer metastasis , 2010, Nature.

[24]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[25]  M. Yousef,et al.  Sequence-based information-theoretic features for gene essentiality prediction , 2017, BMC Bioinformatics.

[26]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[27]  Aimin Li,et al.  PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme , 2014, BMC Bioinformatics.