Nonlinear joint latent variable models and integrative tumor subtype discovery

Integrative analysis has been used to identify clusters by integrating data of disparate types, such as deoxyribonucleic acid (DNA) copy number alterations and DNA methylation changes for discovering novel subtypes of tumors. Most existing integrative analysis methods are based on joint latent variable models, which are generally divided into two classes: joint factor analysis and joint mixture modeling, with continuous and discrete parameterizations of the latent variables respectively. Despite recent progresses, many issues remain. In particular, existing integration methods based on joint factor analysis may be inadequate to model multiple clusters due to the unimodality of the assumed Gaussian distribution, while those based on joint mixture modeling may not have the ability for dimension reduction and/or feature selection. In this paper, we employ a nonlinear joint latent variable model to allow for flexible modeling that can account for multiple clusters as well as conduct dimension reduction and feature selection. We propose a method, called integrative and regularized generative topographic mapping (irGTM), to perform simultaneous dimension reduction across multiple types of data while achieving feature selection separately for each data type. Simulations are performed to examine the operating characteristics of the methods, in which the proposed method compares favorably against the popular iCluster that is based on a linear joint latent variable model. Finally, a glioblastoma multiforme (GBM) dataset is examined.

[1]  H. Wold Path Models with Latent Variables: The NIPALS Approach , 1975 .

[2]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[3]  Kenny Q. Ye,et al.  An Integrative Genomic and Epigenomic Approach for the Study of Transcriptional Regulation , 2008, PloS one.

[4]  Stan Lipovetsky,et al.  Latent Variable Models and Factor Analysis , 2001, Technometrics.

[5]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[6]  H. Pang,et al.  Integrative Pathway Analysis Using Graph-Based Learning with Applications to TCGA Colon and Ovarian Data , 2014, Cancer informatics.

[7]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[8]  Christian Hennig,et al.  Recovering the number of clusters in data sets with noise features using feature rescaling factors , 2015, Inf. Sci..

[9]  Yuan Qi,et al.  Integrated Genomic Analysis Identifies Clinically Relevant Subtypes of Glioblastoma Characterized by Abnormalities in PDGFRA , IDH 1 , EGFR , and NF 1 Citation Verhaak , 2010 .

[10]  P. Laird,et al.  Discovery of multi-dimensional modules by integrative analysis of cancer genomic data , 2012, Nucleic acids research.

[11]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[12]  D. Harrington A class of rank test procedures for censored survival data , 1982 .

[13]  Hongzhe Li,et al.  Web-based Supplementary Materials for “ More powerful genetic association testing via a new statistical framework for integrative genomics ” , 2014 .

[14]  Michael L. Bittner,et al.  Evaluating Gene Set Enrichment Analysis Via a Hybrid Data Model , 2014, Cancer informatics.

[15]  Aria,et al.  INTEGRATIVE MODEL-BASED CLUSTERING OF MICROARRAY METHYLATION AND EXPRESSION DATA , 2011 .

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  Johan Staaf,et al.  Molecular subtypes of breast cancer are associated with characteristic DNA methylation patterns , 2010, Breast Cancer Research.

[18]  Jeffrey S. Morris,et al.  iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data , 2012, Bioinform..

[19]  Guanghua Xiao,et al.  Detection of candidate tumor driver genes using a fully integrated Bayesian approach , 2014, Statistics in medicine.

[20]  Christian A. Rees,et al.  Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Sijian Wang,et al.  SPARSE INTEGRATIVE CLUSTERING OF MULTIPLE OMICS DATA SETS. , 2013, The annals of applied statistics.

[22]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[23]  S. Dolédec,et al.  Co‐inertia analysis: an alternative method for studying species–environment relationships , 1994 .

[24]  T. R. Knapp Canonical correlation analysis: A general parametric significance-testing system. , 1978 .

[25]  Renée X. de Menezes,et al.  Integrated analysis of DNA copy number and gene expression microarray data using gene sets , 2009, BMC Bioinformatics.

[26]  L. Hubert,et al.  Comparing partitions , 1985 .

[27]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[28]  C. Sander,et al.  Integrative Subtype Discovery in Glioblastoma Using iCluster , 2012, PloS one.

[29]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[30]  E. Kaplan,et al.  Nonparametric Estimation from Incomplete Observations , 1958 .

[31]  Peter A. Jones,et al.  The fundamental role of epigenetic events in cancer , 2002, Nature Reviews Genetics.