Deep-learning approach to identifying cancer subtypes using high-dimensional genomic data

MOTIVATION Cancer subtype classification has the potential to significantly improve disease prognosis and develop individualized patient management. Existing methods are limited by their ability to handle extremely high-dimensional data and by the influence of misleading, irrelevant factors, resulting in ambiguous and overlapping subtypes. RESULTS To address the above issues, we proposed a novel approach to disentangling and eliminating irrelevant factors by leveraging the power of deep learning. Specifically, we designed a deep learning framework, referred to as DeepType, that performs joint supervised classification, unsupervised clustering and dimensionality reduction to learn cancer-relevant data representation with cluster structure. We applied DeepType to the METABRIC breast cancer dataset and compared its performance to state-of-the-art methods. DeepType significantly outperformed the existing methods, identifying more robust subtypes while using fewer genes. The new approach provides a framework for the derivation of more accurate and robust molecular cancer subtypes by using increasingly complex, multi-source data. AVAILABILITY An open-source software package for the proposed method is freely available at www.acsu.buffalo.edu/~yijunsun/lab/DeepType.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[2]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Steve Goodison,et al.  Cancer progression modeling using static sample data , 2014, Genome Biology.

[4]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[5]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[6]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[7]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[8]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[9]  Sijian Wang,et al.  SPARSE INTEGRATIVE CLUSTERING OF MULTIPLE OMICS DATA SETS. , 2013, The annals of applied statistics.

[10]  A. Ashworth,et al.  Breast cancer molecular profiling with single sample predictors: a retrospective analysis. , 2010, The Lancet. Oncology.

[11]  John Quackenbush,et al.  A three-gene model to robustly identify breast cancer molecular subtypes. , 2012, Journal of the National Cancer Institute.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[14]  Michael Buck,et al.  SENSE: Siamese neural network for sequence embedding and alignment-free comparison , 2018, Bioinform..

[15]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Jorge S. Reis-Filho,et al.  Microarray-Based Class Discovery for Molecular Classification of Breast Cancer: Analysis of Interobserver Agreement , 2011, Journal of the National Cancer Institute.

[17]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[18]  Virginia G Kaklamani,et al.  Adjuvant Chemotherapy Guided by a 21‐Gene Expression Assay in Breast Cancer , 2018, The New England journal of medicine.

[19]  Amy V Kapp,et al.  Are clusters found in one dataset present in another dataset? , 2007, Biostatistics.

[20]  J. Booth,et al.  Integrative Model-based clustering of microarray methylation and expression data , 2012, 1210.0702.

[21]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[22]  Steve Goodison,et al.  Computational approach for deriving cancer progression roadmaps from static sample data , 2017, Nucleic acids research.

[23]  Steven J. M. Jones,et al.  The Molecular Taxonomy of Primary Prostate Cancer , 2015, Cell.

[24]  Robert Tibshirani,et al.  A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.

[25]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[26]  R. Blamey,et al.  A prognostic index in primary breast cancer. , 1982, British Journal of Cancer.

[27]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[28]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[29]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[30]  Jan Baumbach,et al.  Comparing the performance of biomedical clustering methods , 2015, Nature Methods.

[31]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[32]  P. Laird,et al.  Discovery of multi-dimensional modules by integrative analysis of cancer genomic data , 2012, Nucleic acids research.

[33]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[34]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[35]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[36]  M. J. van de Vijver,et al.  Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. , 2006, Journal of the National Cancer Institute.