Latent Feature Decompositions for Integrative Analysis of Multi-Platform Genomic Data

Increased availability of multi-platform genomics data on matched samples has sparked research efforts to discover how diverse molecular features interact both within and between platforms. In addition, simultaneous measurements of genetic and epigenetic characteristics illuminate the roles their complex relationships play in disease progression and outcomes. However, integrative methods for diverse genomics data are faced with the challenges of ultra-high dimensionality and the existence of complex interactions both within and between platforms. We propose a novel modeling framework for integrative analysis based on decompositions of the large number of platform-specific features into a smaller number of latent features. Subsequently we build a predictive model for clinical outcomes accounting for both within- and between-platform interactions based on Bayesian model averaging procedures. Principal components, partial least squares and non-negative matrix factorization as well as sparse counterparts of each are used to define the latent features, and the performance of these decompositions is compared both on real and simulated data. The latent feature interactions are shown to preserve interactions between the original features and not only aid prediction but also allow explicit selection of outcome-related features. The methods are motivated by and applied to a glioblastoma multiforme data set from The Cancer Genome Atlas to predict patient survival times integrating gene expression, microRNA, copy number and methylation data. For the glioblastoma data, we find a high concordance between our selected prognostic genes and genes with known associations with glioblastoma. In addition, our model discovers several relevant cross-platform interactions such as copy number variation associated gene dosing and epigenetic regulation through promoter methylation. On simulated data, we show that our proposed method successfully incorporates interactions within and between genomic platforms to aid accurate prediction and variable selection. Our methods perform best when principal components are used to define the latent features.

[1]  L. Giménez,et al.  Multiplexed methylation profiles of tumor suppressor genes and clinical outcome in lung cancer , 2010, Journal of Translational Medicine.

[2]  Bernardo Celda,et al.  New pattern of EGFR amplification in glioblastoma and the relationship of gene copy number with gene expression profile , 2010, Modern Pathology.

[3]  Peter Lichter,et al.  Amplification and Expression of Cyclin D Genes (CCND1 CCND2 and CCND3) in Human Malignant Gliomas , 1999, Brain pathology.

[4]  Patrick O. Perry,et al.  Bi-cross-validation of the SVD and the nonnegative matrix factorization , 2009, 0908.2062.

[5]  Keming Yu,et al.  Bayesian Mode Regression , 2012, 1208.0579.

[6]  Joe W. Gray,et al.  Translating insights from the cancer genome into clinical practice , 2008, Nature.

[7]  Ying Dai,et al.  Principal component analysis based methods in bioinformatics studies , 2011, Briefings Bioinform..

[8]  R. Redon,et al.  Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes , 2007, Science.

[9]  Peter A. Jones,et al.  The fundamental role of epigenetic events in cancer , 2002, Nature Reviews Genetics.

[10]  D. Madigan,et al.  Bayesian Model Averaging for Linear Regression Models , 1997 .

[11]  Anne-Laure Boulesteix,et al.  Partial least squares: a versatile tool for the analysis of high-dimensional genomic data , 2006, Briefings Bioinform..

[12]  Daniela M Witten,et al.  Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data , 2009, Statistical applications in genetics and molecular biology.

[13]  Ron Wehrens,et al.  The pls Package: Principal Component and Partial Least Squares Regression in R , 2007 .

[14]  Matthew Meyerson,et al.  Somatic alterations in the human cancer genome. , 2004, Cancer cell.

[15]  P. Cartron,et al.  Dnmt3/transcription factor interactions as crucial players in targeted DNA methylation , 2009, Epigenetics.

[16]  Kam D. Dahlquist,et al.  Regression Approaches for Microarray Data Analysis , 2002, J. Comput. Biol..

[17]  Jeffrey S. Morris,et al.  iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data , 2012, Bioinform..

[18]  조수원 University of Maryland at College Park의 곤충학과 소개 , 1997 .

[19]  U. Moll,et al.  The MDM2-p53 interaction. , 2003, Molecular cancer research : MCR.

[20]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[21]  C. Greenwood,et al.  Data Integration in Genetics and Genomics: Methods and Challenges , 2009, Human genomics and proteomics : HGP.

[22]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[23]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[24]  Yufei Huang,et al.  A nonparametric Bayesian approach for clustering bisulfate-based DNA methylation profiles , 2012, BMC Genomics.

[25]  J. Suykens,et al.  A kernel-based integration of genome-wide data for clinical decision support , 2009, Genome Medicine.

[26]  Hyunsoo Kim,et al.  Sparse Non-negative Matrix Factorizations via Alternating Non-negativity-constrained Least Squares , 2006 .

[27]  L. Chin,et al.  Malignant astrocytic glioma: genetics, biology, and paths to treatment. , 2007, Genes & development.

[28]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[29]  J. Mosser,et al.  DNA methylation in glioblastoma: impact on gene expression and clinical outcome , 2010, BMC Genomics.

[30]  S. Baylin,et al.  DNA methylation and gene silencing in cancer , 2005, Nature Clinical Practice Oncology.