iTOP: inferring the topology of omics data

Motivation In biology, we are often faced with multiple datasets recorded on the same set of objects, such as multi‐omics and phenotypic data of the same tumors. These datasets are typically not independent from each other. For example, methylation may influence gene expression, which may, in turn, influence drug response. Such relationships can strongly affect analyses performed on the data, as we have previously shown for the identification of biomarkers of drug response. Therefore, it is important to be able to chart the relationships between datasets. Results We present iTOP, a methodology to infer a topology of relationships between datasets. We base this methodology on the RV coefficient, a measure of matrix correlation, which can be used to determine how much information is shared between two datasets. We extended the RV coefficient for partial matrix correlations, which allows the use of graph reconstruction algorithms, such as the PC algorithm, to infer the topologies. In addition, since multi‐omics data often contain binary data (e.g. mutations), we also extended the RV coefficient for binary data. Applying iTOP to pharmacogenomics data, we found that gene expression acts as a mediator between most other datasets and drug response: only proteomics clearly shares information with drug response that is not present in gene expression. Based on this result, we used TANDEM, a method for drug response prediction, to identify which variables predictive of drug response were distinct to either gene expression or proteomics. Availability and implementation An implementation of our methodology is available in the R package iTOP on CRAN. Additionally, an R Markdown document with code to reproduce all figures is provided as Supplementary Material. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Diego Colombo,et al.  Order-independent constraint-based causal structure learning , 2012, J. Mach. Learn. Res..

[2]  Jan Vegelius,et al.  On Generalizations of the G Index , 1976 .

[3]  Lodewyk F. A. Wessels,et al.  TANDEM: a two-stage approach to maximize interpretability of drug response models based on multiple molecular data types , 2016, Bioinform..

[4]  S. Ramaswamy,et al.  Systematic identification of genomic markers of drug sensitivity in cancer cells , 2012, Nature.

[5]  F. E. Zegers,et al.  A general family of association coefficients , 1986 .

[6]  Emanuel J. V. Gonçalves,et al.  A Landscape of Pharmacogenomic Interactions in Cancer , 2016, Cell.

[7]  Graham W. Horgan,et al.  Exploratory Analysis of Multiple Omics Datasets Using the Adjusted RV Coefficient , 2011, Statistical applications in genetics and molecular biology.

[8]  P. Robert,et al.  A Unifying Tool for Linear Multivariate Statistical Methods: The RV‐Coefficient , 1976 .

[9]  N. Mantel The detection of disease clustering and a generalized regression approach. , 1967, Cancer research.

[10]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[11]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[12]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[13]  Age K. Smilde,et al.  Real-life metabolomics data analysis : how to deal with complex data ? , 2010 .

[14]  R. Sokal,et al.  Multiple regression and correlation extensions of the mantel test of matrix correspondence , 1986 .

[15]  G. Yule On the Methods of Measuring Association between Two Attributes , 1912 .

[16]  Christine Nardini,et al.  Missing value estimation methods for DNA methylation data , 2019, Bioinform..

[17]  John D. Lafferty,et al.  Diffusion Kernels on Graphs and Other Discrete Input Spaces , 2002, ICML.

[18]  Yiling Lu,et al.  Characterization of Human Cancer Cell Lines by Reverse-phase Protein Arrays. , 2017, Cancer cell.

[19]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .