A Machine Learning Approach to Prostate Cancer Risk Classification Through Use of RNA Sequencing Data

Advancements in RNA sequencing technology have made genomic data acquired during sequencing more precise, making models fitted to sequencing data more practical. Previous studies conducted regarding prostate cancer diagnosis have been limited to microarray data, with limited successes. We utilized The Cancer Genome Atlas’ (TCGA) prostate cancer sequencing data to test the viability of fitting machine learning models to RNA sequencing data. A major challenge associated with the sequencing data is its high dimensionality. In this research, we addressed two complementary tasks. The first was to identify genes most associated with potential cancer. We started by using the mutual information metric to identify the most significant genes. Furthermore, we applied the Recursive Feature Elimination (RFE) algorithm to reduce the number of genes needed to identify cancer. The second task was to create a classification model to separate potential high-risk patients from the healthy ones. For the second task, we combated the high dimensionality challenge with Principal Component Analysis (PCA). In addition to high dimensionality, another challenge is the imbalanced data set that has a 10:1 class imbalance of cancerous and healthy tissue respectively. To combat this problem, we used the Synthetic Minority Oversampling Technique (SMOTE) to create synthetic observations and equalize the class distribution. We trained and tested a logistic regression model using 5-fold cross-validation. The results were promising, significantly reducing the false negative rate as compared to current diagnostic techniques while still keeping the false positive rate low. The model showed great improvements over previous machine learning attempts to diagnose prostate cancer. Our model could be applied as part of the patient diagnosis pipeline, helping to improve accuracy.

[1]  K. Mikami,et al.  Prediction of prostate cancer by deep learning with multilayer artificial neural network , 2018, bioRxiv.

[2]  Anirban P. Mitra,et al.  Discovery and Validation of a Prostate Cancer Genomic Classifier that Predicts Early Metastasis Following Radical Prostatectomy , 2013, PloS one.

[3]  J. Cuzick,et al.  Prognostic value of an RNA expression signature derived from cell cycle proliferation genes in patients with prostate cancer: a retrospective study. , 2011, The Lancet. Oncology.

[4]  G. Parmigiani,et al.  Stromal and epithelial transcriptional map of initiation progression and metastatic potential of human prostate cancer , 2017, Nature Communications.

[5]  Arif Canakoglu,et al.  Exploiting Ladder Networks for Gene Expression Classification , 2018, IWBBIO.

[6]  Y. Kosaka,et al.  Historical progress of the 8th edition of the American Joint Committee on Cancer (AJCC) Cancer Staging Manual in patients with breast cancer , 2019, Translational cancer research.

[7]  Angela Mariotto,et al.  Lead time and overdiagnosis in prostate-specific antigen screening: importance of methods and context. , 2009, Journal of the National Cancer Institute.

[8]  Kimberly R. Kukurba,et al.  RNA Sequencing and Analysis. , 2015, Cold Spring Harbor protocols.

[9]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[10]  S. C. Sahinalp,et al.  Stromal Gene Expression is Predictive for Metastatic Primary Prostate Cancer. , 2017, European urology.

[11]  Martin Ester,et al.  Deep Genomic Signature for early metastasis prediction in prostate cancer , 2018, bioRxiv.

[12]  Yuan Ji,et al.  TCGA-Assembler 2: Software Pipeline for Retrieval and Processing of TCGA/CPTAC Data , 2017, bioRxiv.

[13]  Peter C Albertsen,et al.  The unintended burden of increased prostate cancer detection associated with prostate cancer screening and diagnosis. , 2010, Urology.

[14]  Erem Asil,et al.  How reliable is 12-core prostate biopsy procedure in the detection of prostate cancer? , 2012 .

[15]  N. Razavian,et al.  Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning , 2018, Nature Medicine.

[16]  T. Golub,et al.  mRNA expression signature of Gleason grade predicts lethal prostate cancer. , 2011, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[17]  Sudipta Acharya,et al.  Fusion of stability and multi-objective optimization for solving cancer tissue classification problem , 2018, Expert Syst. Appl..

[18]  Reza Ghaeini,et al.  A Deep Learning Approach for Cancer Detection and Relevant Gene Identification , 2017, PSB.