GLAD: a mixed-membership model for heterogeneous tumor subtype classification

MOTIVATION Genomic analyses of many solid cancers have demonstrated extensive genetic heterogeneity between as well as within individual tumors. However, statistical methods for classifying tumors by subtype based on genomic biomarkers generally entail an all-or-none decision, which may be misleading for clinical samples containing a mixture of subtypes and/or normal cell contamination. RESULTS We have developed a mixed-membership classification model, called glad, that simultaneously learns a sparse biomarker signature for each subtype as well as a distribution over subtypes for each sample. We demonstrate the accuracy of this model on simulated data, in-vitro mixture experiments, and clinical samples from the Cancer Genome Atlas (TCGA) project. We show that many TCGA samples are likely a mixture of multiple subtypes. AVAILABILITY A python module implementing our algorithm is available from http://genomics.wpi.edu/glad/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[2]  M. Eisen,et al.  Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering , 2002, Genome Biology.

[3]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[4]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[5]  Chong Wang,et al.  Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process , 2009, NIPS.

[6]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[7]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[8]  Michèle Sebag,et al.  Machine Learning and Knowledge Discovery in Databases , 2015, Lecture Notes in Computer Science.

[9]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[10]  D. Dexter,et al.  Heterogeneity of tumor cells from a single mouse mammary tumor. , 1978, Cancer research.

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[13]  G. Heppner Tumor heterogeneity. , 1984, Cancer research.

[14]  G. Getz,et al.  Inferring tumour purity and stromal and immune cell admixture from expression data , 2013, Nature Communications.

[15]  S. Gabriel,et al.  Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. , 2010, Cancer cell.

[16]  Andrew M. Gross,et al.  Network-based stratification of tumor mutations , 2013, Nature Methods.

[17]  Charles Swanton,et al.  Intratumor Heterogeneity and Branched Evolution REPLY , 2012 .

[18]  Krishna M. Sivalingam,et al.  Machine Learning and Knowledge Discovery in Databases , 2011, Lecture Notes in Computer Science.

[19]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.

[20]  D. Busam,et al.  An Integrated Genomic Analysis of Human Glioblastoma Multiforme , 2008, Science.

[21]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[22]  Chong Wang,et al.  Variational inference in nonconjugate models , 2012, J. Mach. Learn. Res..

[23]  Ata Kabán,et al.  On Bayesian classification with Laplace priors , 2007, Pattern Recognit. Lett..

[24]  P. A. Futreal,et al.  Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. , 2012, The New England journal of medicine.

[25]  Matt Taddy,et al.  Multinomial Inverse Regression for Text Analysis , 2010, 1012.2098.

[26]  Y. Kudo,et al.  Periostin: novel diagnostic and therapeutic target for cancer. , 2007, Histology and histopathology.

[27]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[28]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[29]  W. Cavenee,et al.  Heterogeneity maintenance in glioblastoma: a social network. , 2011, Cancer research.

[30]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[31]  Jürgen Winkler,et al.  Transient expression of doublecortin during adult neurogenesis , 2003, The Journal of comparative neurology.

[32]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[33]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[34]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[35]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[36]  Florian Markowetz,et al.  Poor-prognosis colon cancer is defined by a molecularly distinct subtype and develops from serrated precursor lesions , 2013, Nature Medicine.

[37]  Aleix Prat Aparicio Comprehensive molecular portraits of human breast tumours , 2012 .

[38]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[39]  F. Zhan,et al.  The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. , 2003, The New England journal of medicine.

[40]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[41]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: dominant markers and null alleles , 2007, Molecular ecology notes.

[42]  J. Uhm Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2009 .

[43]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[44]  Bradley Efron,et al.  Large-scale inference , 2010 .

[45]  Mark M. Davis,et al.  Cell type–specific gene expression differences in complex tissues , 2010, Nature Methods.

[46]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[47]  Colin Campbell,et al.  The Latent Process Decomposition of cDNA Microarray Data Sets , 2005, TCBB.

[48]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Junhui Wang Consistent selection of the number of clusters via crossvalidation , 2010 .