Cancer classification of single-cell gene expression data by neural network

MOTIVATION Cancer classification based on gene expression profiles has provided insight on the causes of cancer and cancer treatment. Recently, machine learning-based approaches have been attempted in downstream cancer analysis to address the large differences in gene expression values, as determined by single-cell RNA sequencing (scRNA-seq). RESULTS We designed cancer classifiers that can identify 21 types of cancers and normal tissues based on bulk RNA-seq as well as scRNA-seq data. Training was performed with 7,398 cancer samples and 640 normal samples from 21 tumors and normal tissues in TCGA based on the 300 most significant genes expressed in each cancer. Then, we compared neural network (NN), support vector machine (SVM), k-nearest neighbors (kNN), and random forest (RF) methods. The neural network performed consistently better than other methods. We further applied our approach to scRNA-seq transformed by kNN smoothing and found that our model successfully classified cancer types and normal samples. AVAILABILITY Cancer classification by neural network. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  S. Gabriel,et al.  Advances in understanding cancer genomes through second-generation sequencing , 2010, Nature Reviews Genetics.

[2]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[3]  Victor X. Jin,et al.  Single-Cell RNA-seq Reveals a Subpopulation of Prostate Cancer Cells with Enhanced Cell-Cycle-Related Transcription and Attenuated Androgen Response. , 2018, Cancer research.

[4]  E. Wang,et al.  Predictive genomics: a cancer hallmark network framework for predicting tumor clinical phenotypes using genome sequencing data. , 2014, Seminars in cancer biology.

[5]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[6]  Shihua Zhang,et al.  Discovery of cancer common and specific driver gene sets , 2016, Nucleic acids research.

[7]  Steven J. M. Jones,et al.  Comprehensive molecular characterization of human colon and rectal cancer , 2012, Nature.

[8]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[9]  Liang Chen,et al.  BCseq: accurate single cell RNA-seq quantification with bias correction , 2018, Nucleic acids research.

[10]  Genevera I. Allen,et al.  TCGA2STAT: simple TCGA data access for integrated statistical analysis in R , 2016, Bioinform..

[11]  Dincer Goksuluk,et al.  A comprehensive simulation study on classification of RNA-Seq data , 2017, PloS one.

[12]  X. Bian,et al.  Large-scale RNA-Seq Transcriptome Analysis of 4043 Cancers and 548 Normal Tissue Controls across 12 TCGA Cancer Types , 2015, Scientific Reports.

[13]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[14]  S. Gabriel,et al.  Pan-cancer patterns of somatic copy-number alteration , 2013, Nature Genetics.

[15]  Jeong Eon Lee,et al.  Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer , 2017, Nature Communications.

[16]  W. Zong,et al.  SCCA1/SERPINB3 promotes oncogenesis and epithelial-mesenchymal transition via the unfolded protein response and IL6 signaling. , 2014, Cancer research.

[17]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[18]  L. Hartwell,et al.  Cell cycle control and cancer. , 1994, Science.

[19]  Benjamin J. Raphael,et al.  Integrated Genomic Analyses of Ovarian Carcinoma , 2011, Nature.

[20]  Peilin Jia,et al.  Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers , 2013, Genome Medicine.

[21]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[22]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[23]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[24]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[25]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[26]  O. Stegle,et al.  Deep learning for computational biology , 2016, Molecular systems biology.

[27]  Rona S. Gertner,et al.  Single cell RNA Seq reveals dynamic paracrine control of cellular variation , 2014, Nature.

[28]  Peter W. Laird,et al.  Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer , 2018, Cell.

[29]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[30]  Kouros Owzar,et al.  Supplementary Issue: Array Platform Modeling and Analysis (b) next Generation Distributed Computing for Cancer Research Scalable Computing Systems , 2022 .

[31]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[32]  Dennis G. Zill,et al.  Advanced Engineering Mathematics , 2021, Technometrics.

[33]  A. Butte,et al.  Systematic pan-cancer analysis of tumour purity , 2015, Nature Communications.

[34]  Charles H. Yoon,et al.  Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq , 2016, Science.

[35]  Sung-Hou Kim,et al.  Prediction of inherited genomic susceptibility to 20 common cancer types by a supervised machine-learning method , 2018, Proceedings of the National Academy of Sciences.

[36]  David M. Umbach,et al.  A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data , 2017, BMC Genomics.

[37]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[38]  Altuna Akalin,et al.  netSmooth: Network-smoothing based imputation for single cell RNA-seq , 2017, bioRxiv.

[39]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Travers Ching,et al.  Single-Cell Transcriptomics Bioinformatics and Computational Challenges , 2016, Front. Genet..

[41]  Joshua M. Korn,et al.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2008, Nature.

[42]  J. Weinstein,et al.  A Pan-Cancer Analysis of Enhancer Expression in Nearly 9000 Patient Samples. , 2018, Cell.

[43]  Fabian J Theis,et al.  Single-cell RNA-seq denoising using a deep count autoencoder , 2019, Nature Communications.

[44]  Yu Fan,et al.  BioXpress: an integrated RNA-seq-derived gene expression database for pan-cancer analysis , 2015, Database J. Biol. Databases Curation.

[45]  Victor Treviño,et al.  Comparison of gene expression patterns across twelve tumor types identifies a cancer supercluster characterized by TP53 mutations and cell cycle defects , 2014, Oncogene.

[46]  Benjamin J. Raphael,et al.  Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin , 2014, Cell.

[47]  Friedrich Riesz Untersuchungen über Systeme integrierbarer Funktionen , 1910 .