Pan-cancer Feature Selection and Classification Reveals Important Long Non-coding RNAs

Long noncoding RNA plays important role in changing the expression profiles of various target genes that leads to cancer development. So, identifying key lncRNAs related to the origin of different types of cancers might help in developing cancer therapy. To discover the critical lncRNAs that can identify the origin of different cancers, we proposed to use the state-of-the-art deep learning algorithm Concreate Autoencoder (CAE). The motivation behind using the CAE was that it takes advantage of both AE (which can achieve the highest classification accuracy) and concrete relaxation-based feature selection (which is capable of selecting actual features instead of latent features). To compare the performance of CAE, three frequently used embedded feature selection techniques including Least Absolute Shrinkage and Selection Operator (LASSO), Random Forest (RF), and Support Vector Machine with Recursive Feature Elimination (SVM-RFE) were used. To obtain a stable set of lncRNAs capable of identifying the origin of 33 different cancers, a lncRNA that was isolated by at least two of the four techniques (CAE, LASSO, RF, and SVM-RFE) was added to the final list of key lncRNAs.The genome-wide lncRNA expression profiles of 33 different types of cancers, a total of 9566 samples, available in The Cancer Genome Atlas (TCGA) were analyzed to discover the key lncRNAs. Our results showed that CAE performs better in feature selection, specially, in selecting small number of features, compared to LASSO, RF, and SVM-RFE. With the increasing number of selected features ranging from 10 to 500 lncRNAs, the accuracy of different feature selection approaches increases as - CAE: 70% to 96%; LASSO: 55% to 94%; RF: 38% to 95%; SVM-RFE: 50% to 94%. This study discovered a set of 69 lncRNAs that can identify the origin of 33 different cancers with an accuracy of 93%. Note that the accuracy could be higher using AE, which uses latent features for classification thus failing to correlate the origin of cancers with the actual features (lncRNAs).The proposed computational framework can be used as a diagnostic tool by the physicians to discover the origin of cancers using the expression profiles of lncRNAs. The discovered lncRNAs can be studied further by biologists or drug designer to identify possible targets for cancer therapy.

[1]  Kuinam J. Kim,et al.  A Feature Selection Approach Based on Simulated Annealing for Detecting Various Denial of Service Attacks , 2016 .

[2]  Jamshid Pirgazi,et al.  An Efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets , 2019, Scientific Reports.

[3]  Jun Li,et al.  Emerging role of long noncoding RNAs in lung cancer: Current status and future prospects. , 2016, Respiratory medicine.

[4]  Mengjie Zhang,et al.  Particle Swarm Optimization for Feature Selection in Classification: A Multi-Objective Approach , 2013, IEEE Transactions on Cybernetics.

[5]  J. Mattick,et al.  Long noncoding RNAs and the genetics of cancer , 2013, British Journal of Cancer.

[6]  Mary Goldman,et al.  The UCSC Xena platform for public and private cancer genomics data visualization and interpretation , 2018, bioRxiv.

[7]  H. Ding,et al.  Identification of mitochondrial proteins of malaria parasite using analysis of variance , 2014, Amino Acids.

[8]  Paola Zuccolotto,et al.  Variable Selection Using Random Forests , 2006 .

[9]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[12]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[13]  E. Kaplan,et al.  Nonparametric Estimation from Incomplete Observations , 1958 .

[14]  R. Wolfe,et al.  Transient effects in the Cox proportional hazards regression model. , 1995, Statistics in medicine.

[15]  Hamid Sheikhzadeh,et al.  Deep Feature Selection using a Teacher-Student Network , 2019, Neurocomputing.

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  Yi-Leh Wu,et al.  Feature selection using genetic algorithm and cluster validation , 2011, Expert Syst. Appl..

[18]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[19]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[20]  Yang Wang,et al.  Autoencoder Based Feature Selection Method for Classification of Anticancer Drug Response , 2019, Front. Genet..

[21]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[22]  O. Chapelle Multi-Class Feature Selection with Support Vector Machines , 2008 .

[23]  Chengbo Lu,et al.  The Cross-Entropy Based Multi-Filter Ensemble Method for Gene Selection , 2018, Genes.

[24]  James Zou,et al.  Concrete Autoencoders for Differentiable Feature Selection and Reconstruction , 2019, ArXiv.

[25]  William Stafford Noble,et al.  DeepPINK: reproducible feature selection in deep neural networks , 2018, NeurIPS.

[26]  Ali Najafi,et al.  Classification and Biomarker Genes Selection for Cancer Gene Expression Data Using Random Forest , 2017, Iranian journal of pathology.

[27]  Andrea Rau,et al.  Exploring Drivers of Gene Expression in The Cancer Genome Atlas , 2017, bioRxiv.

[28]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[29]  Lin Sun,et al.  A Hybrid Gene Selection Method Based on ReliefF and Ant Colony Optimization Algorithm for Tumor Classification , 2019, Scientific Reports.

[30]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[31]  A. Al Mamun,et al.  Feature Selection and Classification Reveal Key lncRNAs for Multiple Cancers , 2019, 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[32]  Peter W. Laird,et al.  Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer , 2018, Cell.

[33]  Gjergji Kasneci,et al.  CancelOut: A Layer for Feature Selection in Deep Neural Networks , 2019, ICANN.

[34]  Melissa J. Fullwood,et al.  Roles, Functions, and Mechanisms of Long Non-coding RNAs in Cancer , 2016, Genom. Proteom. Bioinform..

[35]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[36]  Maziyar Baran Pouyan,et al.  Random forest based similarity learning for single cell RNA sequencing data , 2018, bioRxiv.

[37]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[38]  Ruihu Wang,et al.  AdaBoost for Feature Selection, Classification and Its Relation with SVM, A Review , 2012 .

[39]  Kazuyuki Murase,et al.  A new hybrid ant colony optimization algorithm for feature selection , 2012, Expert Syst. Appl..

[40]  Jianwen Fang,et al.  Tightly integrated genomic and epigenomic data mining using tensor decomposition , 2018, Bioinform..

[41]  Jason H. Moore,et al.  STatistical Inference Relief (STIR) feature selection , 2018, bioRxiv.

[42]  Howard Y. Chang,et al.  Long Noncoding RNAs in Cancer Pathways. , 2016, Cancer cell.

[43]  A. Al Mamun,et al.  Long Non-coding RNA Based Cancer Classification using Deep Neural Networks , 2019, Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics.