Factor analysis for survival time prediction with informative censoring and diverse covariates

Fulfilling the promise of precision medicine requires accurately and precisely classifying disease states. For cancer, this includes prediction of survival time from a surfeit of covariates. Such data presents an opportunity for improved prediction, but also a challenge due to high dimensionality. Furthermore, disease populations can be heterogeneous. Integrative modeling is sensible, as the underlying hypothesis is that joint analysis of multiple covariates provides greater explanatory power than separate analyses. We propose an integrative latent variable model that combines factor analysis for various data types and an exponential proportional hazards (EPH) model for continuous survival time with informative censoring. The factor and EPH models are connected through low-dimensional latent variables that can be interpreted and visualized to identify subpopulations. We use this model to predict survival time. We demonstrate this model's utility in simulation and on four Cancer Genome Atlas datasets: diffuse lower-grade glioma, glioblastoma multiforme, lung adenocarcinoma, and lung squamous cell carcinoma. These datasets have small sample sizes, high-dimensional diverse covariates, and high censorship rates. We compare the predictions from our model to three alternative models. Our model outperforms in simulation and is competitive on real datasets. Furthermore, the low-dimensional visualization for diffuse lower-grade glioma displays known subpopulations.

[1]  A. Frigessi,et al.  Principles and methods of integrative genomic analyses in cancer , 2014, Nature Reviews Cancer.

[2]  H. B. Heywood,et al.  On finite sequences of real numbers , 1931 .

[3]  Brandon D Gallas,et al.  Comparing two correlated C indices with right‐censored survival outcome: a one‐shot nonparametric approach , 2015, Statistics in medicine.

[4]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[5]  J. Kalbfleisch,et al.  The Statistical Analysis of Failure Time Data: Kalbfleisch/The Statistical , 2002 .

[6]  Steven J. M. Jones,et al.  Integrated genomic characterization of endometrial carcinoma , 2013, Nature.

[7]  B. Coull,et al.  Supervised Bayesian latent class models for high‐dimensional data , 2012, Statistics in medicine.

[8]  C. Sander,et al.  Integrative Subtype Discovery in Glioblastoma Using iCluster , 2012, PloS one.

[9]  K. Larsen,et al.  The Cox Proportional Hazards Model with a Continuous Latent Variable Measured by Multiple Binary Indicators , 2005, Biometrics.

[10]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[11]  Bengt Muthén,et al.  Discrete-Time Survival Mixture Analysis , 2005 .

[12]  Fiona Steele,et al.  Latent variable models for mixed categorical and survival responses, with an application to fertility preferences and family planning in Bangladesh , 2005 .

[13]  Steven J. M. Jones,et al.  Comprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade Gliomas. , 2015, The New England journal of medicine.

[14]  C. Sander,et al.  Pattern discovery and cancer gene identification in integrated cancer genomic data , 2013, Proceedings of the National Academy of Sciences.

[15]  Yang Feng,et al.  High-dimensional variable selection for Cox's proportional hazards model , 2010, 1002.3315.

[16]  K. Larsen,et al.  Joint Analysis of Time‐to‐Event and Multiple Binary Indicators of Latent Classes , 2004, Biometrics.

[17]  Genevera I. Allen,et al.  TCGA2STAT: simple TCGA data access for integrated statistical analysis in R , 2016, Bioinform..

[18]  M. Wedel,et al.  Factor analysis with (mixed) observed and latent variables in the exponential family , 2001 .

[19]  D. Zeng,et al.  Efficient Estimation for Semiparametric Structural Equation Models With Censored Data , 2018, Journal of the American Statistical Association.

[20]  Shirley Pepke,et al.  Comprehensive discovery of subsample gene expression components by information explanation: therapeutic implications in cancer , 2016, BMC Medical Genomics.

[21]  D. Harrington,et al.  Counting Processes and Survival Analysis: Fleming/Counting , 2005 .

[22]  L. Holmberg,et al.  A latent class model for competing risks , 2017, Statistics in medicine.