Molecular heterogeneity at the network level: high-dimensional testing, clustering and a TCGA case study

Motivation: Molecular pathways and networks play a key role in basic and disease biology. An emerging notion is that networks encoding patterns of molecular interplay may themselves differ between contexts, such as cell type, tissue or disease (sub)type. However, while statistical testing of differences in mean expression levels has been extensively studied, testing of network differences remains challenging. Furthermore, since network differences could provide important and biologically interpretable information to identify molecular subgroups, there is a need to consider the unsupervised task of learning subgroups and networks that define them. This is a nontrivial clustering problem, with neither subgroups nor subgroup‐specific networks known at the outset. Results: We leverage recent ideas from high‐dimensional statistics for testing and clustering in the network biology setting. The methods we describe can be applied directly to most continuous molecular measurements and networks do not need to be specified beforehand. We illustrate the ideas and methods in a case study using protein data from The Cancer Genome Atlas (TCGA). This provides evidence that patterns of interplay between signalling proteins differ significantly between cancer types. Furthermore, we show how the proposed approaches can be used to learn subtypes and the molecular networks that define them. Availability and implementation: As the Bioconductor package nethet. Contact: staedler.n@gmail.com or sach.mukherjee@dzne.de Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Sach Mukherjee,et al.  Two-Sample Testing in High-Dimensional Models , 2012 .

[2]  Xiaotong Shen,et al.  Penalized model-based clustering with unconstrained covariance matrices. , 2009, Electronic journal of statistics.

[3]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[4]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[5]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[6]  Adrian E. Raftery,et al.  mclust Version 4 for R : Normal Mixture Modeling for Model-Based Clustering , Classification , and Density Estimation , 2012 .

[7]  Sach Mukherjee,et al.  Penalized estimation in high-dimensional hidden Markov models with state-specific graphical models , 2012, 1208.4989.

[8]  T. Meehan,et al.  An atlas of active enhancers across human cell types and tissues , 2014, Nature.

[9]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[10]  H. Stunnenberg,et al.  BLUEPRINT: mapping human blood cell epigenomes , 2013, Haematologica.

[11]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[12]  Sach Mukherjee,et al.  Two-sample testing in high dimensions , 2017 .

[13]  Sach Mukherjee,et al.  Network clustering: probing biological heterogeneity by sparse graphical models , 2011, Bioinform..

[14]  Benjamin J. Raphael,et al.  Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin , 2014, Cell.

[15]  Christian Hennig,et al.  Cluster-wise assessment of cluster stability , 2007, Comput. Stat. Data Anal..

[16]  Sach Mukherjee,et al.  Multivariate gene-set testing based on graphical models. , 2015, Biostatistics.

[17]  Song-xi Chen,et al.  A two-sample test for high-dimensional data with applications to gene-set testing , 2010, 1002.4547.

[18]  Prahlad T. Ram,et al.  A pan-cancer proteomic perspective on The Cancer Genome Atlas , 2014, Nature Communications.

[19]  Riet De Smet,et al.  Advantages and limitations of current network inference methods , 2010, Nature Reviews Microbiology.