Survival-supervised latent Dirichlet allocation models for genomic analysis of time-to-event outcomes

Two challenging problems in the clinical study of cancer are the characterization of cancer subtypes and the classification of individual patients according to those subtypes. Statistical approaches addressing these problems are hampered by population heterogeneity and challenges inherent in data integration across high-dimensional, diverse covariates. We have developed a survival-supervised latent Dirichlet allocation (survLDA) modeling framework to address these concerns. LDA models have proven extremely effective at identifying themes common across large collections of text, but applications to genomics have been limited. Our framework extends LDA to the genome by considering each patient as a `document' with `text' constructed from clinical and high-dimensional genomic measurements. We then further extend the framework to allow for supervision by a time-to-event response. The model enables the efficient identification of collections of clinical and genomic features that co-occur within patient subgroups, and then characterizes each patient by those features. An application of survLDA to The Cancer Genome Atlas (TCGA) ovarian project identifies informative patient subgroups that are characterized by different propensities for exhibiting abnormal mRNA expression and methylations, corresponding to differential rates of survival from primary therapy.

[1]  R. Tothill,et al.  Novel Molecular Subtypes of Serous and Endometrioid Ovarian Cancer Linked to Clinical Outcome , 2008, Clinical Cancer Research.

[2]  Hongzhe Li,et al.  Kernel Cox Regression Models for Linking Gene Expression Profiles to Censored Survival Data , 2002, Pacific Symposium on Biocomputing.

[3]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[4]  Daniela M Witten,et al.  Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data , 2009, Statistical applications in genetics and molecular biology.

[5]  Xi Chen,et al.  An Integrative Pathway-based Clinical-genomic Model for Cancer Survival Prediction. , 2010, Statistics & probability letters.

[6]  M. Schön,et al.  INTS6/DICE1 inhibits growth of human androgen-independent prostate cancer cells by altering the cell cycle profile and Wnt signaling , 2009, Cancer Cell International.

[7]  Benjamin J. Raphael,et al.  Integrated Genomic Analyses of Ovarian Carcinoma , 2011, Nature.

[8]  Hongzhe Li,et al.  Dimension reduction methods for microarrays with application to censored survival data , 2004, Bioinform..

[9]  B S Weir,et al.  Truncated product method for combining P‐values , 2002, Genetic epidemiology.

[10]  David I. Smith,et al.  Mutations in the arginine-rich protein gene (ARP) in pancreatic cancer , 1997, Oncogene.

[11]  Jian Huang,et al.  BMC Bioinformatics BioMed Central Methodology article Supervised group Lasso with applications to microarray data , 2007 .

[12]  Hongzhe Li,et al.  Nonparametric pathway-based regression models for analysis of genomic data. , 2007, Biostatistics.

[13]  O. Aalen,et al.  Heterogeneity in survival analysis. , 1988, Statistics in medicine.

[14]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[15]  Peter Bühlmann,et al.  Supervised clustering of genes , 2002, Genome Biology.

[16]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[17]  Xiaojin Zhu,et al.  Latent Dirichlet Allocation with Topic-in-Set Knowledge , 2009, HLT-NAACL 2009.

[18]  R. Irizarry,et al.  A gene expression bar code for microarray data , 2007, Nature Methods.

[19]  N. Breslow Covariance analysis of censored survival data. , 1974, Biometrics.

[20]  Adrian E. Raftery,et al.  MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering † , 2007 .

[21]  Jácint Szabó,et al.  Latent dirichlet allocation in web spam filtering , 2008, AIRWeb '08.

[22]  F. Lallemand,et al.  Expression analysis of mitotic spindle checkpoint genes in breast carcinoma: role of NDC80/HEC1 in early breast tumorigenicity, and a two-gene signature for aneuploidy , 2011, Molecular Cancer.

[23]  Xi Chen,et al.  Integrating Biological Knowledge with Gene Expression Profiles for Survival Prediction of Cancer , 2009, J. Comput. Biol..

[24]  Debashis Ghosh,et al.  Combining multiple models with survival data: the PHASE algorithm , 2010, BIOINFORMATICS 2010.

[25]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[26]  M. Merino,et al.  A high-risk lesion for invasive breast cancer, ductal carcinoma in situ, exhibits frequent overexpression of retinoid X receptor. , 1998, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[27]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[28]  Jiang Gui,et al.  Partial Cox regression analysis for high-dimensional microarray gene expression data , 2004, ISMB/ECCB.

[29]  D. Grignon,et al.  Mutations in the arginine-rich protein gene, in lung, breast, and prostate cancers, and in squamous cell carcinoma of the head and neck. , 1996, Cancer research.

[30]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[31]  G. Parmigiani,et al.  Core Signaling Pathways in Human Pancreatic Cancers Revealed by Global Genomic Analyses , 2008, Science.

[32]  D. Cox Regression Models and Life-Tables , 1972 .

[33]  D.,et al.  Regression Models and Life-Tables , 2022 .

[34]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[35]  Hongyu Zhao,et al.  Pathway analysis using random forests with bivariate node-split for survival outcomes , 2010, Bioinform..