LncLocation: Efficient Subcellular Location Prediction of Long Non-Coding RNA-Based Multi-Source Heterogeneous Feature Fusion

Recent studies uncover that subcellular location of long non-coding RNAs (lncRNAs) can provide significant information on its function. Due to the lack of experimental data, the number of lncRNAs is very limited, experimentally verified subcellular localization, and the numbers of lncRNAs located in different organelle are wildly imbalanced. The prediction of subcellular location of lncRNAs is actually a multi-classification small sample imbalance problem. The imbalance of data results in the poor recognition effect of machine learning models on small data subsets, which is a puzzling and challenging problem in the existing research. In this study, we integrate multi-source features to construct a sequence-based computational tool, lncLocation, to predict the subcellular location of lncRNAs. Autoencoder is used to enhance part of the features, and the binomial distribution-based filtering method and recursive feature elimination (RFE) are used to filter some of the features. It improves the representation ability of data and reduces the problem of unbalanced multi-classification data. By comprehensive experiments on different feature combinations and machine learning models, we select the optimal features and classifier model scheme to construct a subcellular location prediction tool, lncLocation. LncLocation can obtain an 87.78% accuracy using 5-fold cross validation on the benchmark data, which is higher than the state-of-the-art tools, and the classification performance, especially for small class sets, is improved significantly.

[1]  Y. Ouyang,et al.  LncRNA SNHG1 promotes EMT process in gastric cancer cells through regulation of the miR-15b/DCLK1/Notch1 axis , 2020, BMC Gastroenterology.

[2]  Lin Gao,et al.  Cluster correlation based method for lncRNA-disease association prediction , 2020, BMC Bioinformatics.

[3]  A. Nair,et al.  A coding measure scheme employing electron-ion interaction pseudopotential (EIIP) , 2006, Bioinformation.

[4]  Daniel A. Braun,et al.  Occam's Razor in sensorimotor learning , 2013, Proceedings of the Royal Society B: Biological Sciences.

[5]  Jianwei Jiao,et al.  Acquisition of functional neurons by direct conversion: Switching the developmental clock directly. , 2019, Journal of genetics and genomics = Yi chuan xue bao.

[6]  Vladimir P Zhdanov,et al.  Kinetic models of the interference of gene transcription to ncRNA and mRNA. , 2011, Chaos.

[7]  Q. Ma,et al.  lncRNA OGFRP1 functions as a ceRNA to promote the progression of prostate cancer by regulating SARM1 level via miR-124-3p , 2020, Aging.

[8]  Sean R. Eddy,et al.  Rfam 11.0: 10 years of RNA families , 2012, Nucleic Acids Res..

[9]  Anli Hou,et al.  LncRNA terminal differentiation-induced ncRNA (TINCR) sponges miR-302 to upregulate cyclin D1 in cervical squamous cell carcinoma (CSCC) , 2019, Human Cell.

[10]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[11]  Zhi-hua Chen,et al.  Identification of an Immune-Related Nine-lncRNA Signature Predictive of Overall Survival in Colon Cancer , 2020, Frontiers in Genetics.

[12]  Roderic Guigo,et al.  LncATLAS database for subcellular localization of long noncoding RNAs , 2017, bioRxiv.

[13]  Xin Zhou,et al.  MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data , 2007, Bioinform..

[14]  M. de Castro,et al.  Clinical, Molecular, Functional, and Structural Characterization of CYP17A1 Mutations in Brazilian Patients with 17-Hydroxylase Deficiency , 2020, Hormone and Metabolic Research.

[15]  Hao Lin,et al.  Predicting the Organelle Location of Noncoding RNAs Using Pseudo Nucleotide Compositions , 2017, Interdisciplinary Sciences: Computational Life Sciences.

[16]  Yung-Hsiang Hung,et al.  SVM-RFE Based Feature Selection and Taguchi Parameters Optimization for Multiclass SVM Classifier , 2014, TheScientificWorldJournal.

[17]  J. Kocher,et al.  CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model , 2013, Nucleic acids research.

[18]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[19]  R. Garrett,et al.  Divergent transcriptional and translational signals in Archaea. , 2005, Environmental microbiology.

[20]  Yan Huang,et al.  RNALocate: a resource for RNA subcellular localizations , 2016, Nucleic Acids Res..

[21]  Gabriele Ausiello,et al.  A novel approach to represent and compare RNA secondary structures , 2014, Nucleic acids research.

[22]  Eli Brenner,et al.  Structure learning and the Occam's razor principle: a new view of human function acquisition , 2014, Front. Comput. Neurosci..

[23]  Peter F. Stadler,et al.  ViennaRNA Package 2.0 , 2011, Algorithms for Molecular Biology.

[24]  Jin Wang,et al.  MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes , 2007, BMC Bioinformatics.

[25]  Kwong-Sak Leung,et al.  Quantification of non-coding RNA target localization diversity and its application in cancers , 2018, Journal of molecular cell biology.

[26]  Li Lin,et al.  Bilinear Grid Search Strategy Based Support Vector Machines Learning Method , 2014, Informatica.

[27]  Jiaming Yin,et al.  Characterization and evolution of 5' and 3' untranslated regions in eukaryotes. , 2012, Gene.

[28]  Luyao Zhao,et al.  Identification of transcriptional biomarkers by RNA-sequencing for improved detection of β2-agonists abuse in goat skeletal muscle , 2017, PloS one.

[29]  Bing Wang,et al.  Developing Computational Model to Predict Protein-Protein Interaction Sites Based on the XGBoost Algorithm , 2020, International journal of molecular sciences.

[30]  E. Khairy,et al.  Exenatide promotes cardiac lncRNAs HOX transcript antisense RNA (HOTAIR) in Wistar rats with liver cirrhosis; a novel role of GLP-1 receptor agonists in cirrhotic cardiomyopathy. , 2019, European journal of pharmacology.

[31]  Wei Chen,et al.  Sequence-based predictive modeling to identify cancerlectins , 2017, Oncotarget.

[32]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[33]  Guang-Rong Yan,et al.  A Peptide Encoded by a Putative lncRNA HOXB-AS3 Suppresses Colon Cancer Growth. , 2017, Molecular cell.

[34]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[35]  Yaohang Li,et al.  SDLDA: lncRNA-disease association prediction based on singular value decomposition and deep learning. , 2020, Methods.

[36]  K. Chou,et al.  Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. , 2015, Molecular bioSystems.

[37]  Yanchun Liang,et al.  LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property , 2018, Briefings Bioinform..

[38]  Wei Chen,et al.  Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition. , 2015, Molecular bioSystems.

[39]  Jean-François Deleuze,et al.  Mitochondrial ncRNA targeting induces cell cycle arrest and tumor growth inhibition of MDA-MB-231 breast cancer cells through reduction of key cell cycle progression factors , 2019, Cell Death & Disease.

[40]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[41]  Zhen Cao,et al.  The lncLocator: a subcellular localization predictor for long non‐coding RNAs based on a stacked ensemble classifier , 2018, Bioinform..

[42]  Marcela Perrone-Bertolotti,et al.  Machine learning–XGBoost analysis of language networks to classify patients with epilepsy , 2017, Brain Informatics.

[43]  May D. Wang,et al.  LncADeep: an ab initio lncRNA identification and functional annotation tool based on deep learning , 2018, Bioinform..

[44]  Colleen M. Iversen,et al.  Physical and Functional Constraints on Viable Belowground Acquisition Strategies , 2019, Front. Plant Sci..

[45]  Wei Wang,et al.  Thyroglobulin can be a functional biomarker of iodine deficiency, thyroid nodules, and goiter in Chinese pregnant women. , 2020, Asia Pacific journal of clinical nutrition.

[46]  Qing-Guo Wang,et al.  XGBoost Model for Chronic Kidney Disease Diagnosis , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[47]  Yi Zhao,et al.  Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts , 2013, Nucleic acids research.

[48]  Ferran Reverter,et al.  SVM-RFE: selection and visualization of the most relevant features through non-linear kernels , 2018, BMC Bioinformatics.

[49]  U. Bastolla,et al.  Structural approaches to sequence evolution : molecules, networks, populations , 2007 .

[50]  Dong Wang,et al.  iLoc‐lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC , 2018, Bioinform..

[51]  Xuebin Li,et al.  Analysis of transcription factor- and ncRNA-mediated potential pathogenic gene modules in Alzheimer’s disease , 2019, Aging.

[52]  Huaiqiu Zhu,et al.  Gene prediction in metagenomic fragments based on the SVM algorithm , 2011, 2011 4th International Conference on Biomedical Engineering and Informatics (BMEI).