Background Nowadays, multiple omics data are measured on the same samples in the belief that these different omics datasets represent various aspects of the underlying biological systems. Integrating these omics datasets will facilitate the understanding of the systems. For this purpose, various methods have been proposed, such as Partial Least Squares (PLS), decomposing two datasets into joint and residual subspaces. Since omics data are heterogeneous, the joint components in PLS will contain variation specific to each dataset. To account for this, Two-way Orthogonal Partial Least Squares (O2PLS) captures the heterogeneity by introducing orthogonal subspaces and better estimates the joint subspaces. However, the latent components spanning the joint subspaces in O2PLS are linear combinations of all variables, while it might be of interest to identify a small subset relevant to the research question. To obtain sparsity, we extend O2PLS to Group Sparse O2PLS (GO2PLS) that utilizes biological information on group structures among variables and performs group selection in the joint subspace. Results The simulation study showed that introducing sparsity improved the feature selection performance. Furthermore, incorporating group structures increased robustness of the feature selection procedure. GO2PLS performed optimally in terms of accuracy of joint score estimation, joint loading estimation, and feature selection. We applied GO2PLS to datasets from two studies: TwinsUK (a population study) and CVON-DOSIS (a small case-control study). In the first, we incorporated biological information on the group structures of the methylation CpG sites when integrating the methylation dataset with the IgG glycomics data. The targeted genes of the selected methylation groups turned out to be relevant to the immune system, in which the IgG glycans play important roles. In the second, we selected regulatory regions and transcripts that explained the covariance between regulomics and transcriptomics data. The corresponding genes of the selected features appeared to be relevant to heart muscle disease. Conclusions GO2PLS integrates two omics datasets to help understand the underlying system that involves both omics levels. It incorporates external group information and performs group selection, resulting in a small subset of features that best explain the relationship between two omics datasets for better interpretability.
[1]
M. Yuan,et al.
Model selection and estimation in regression with grouped variables
,
2006
.
[2]
Anne-Laure Boulesteix,et al.
Partial least squares: a versatile tool for the analysis of high-dimensional genomic data
,
2006,
Briefings Bioinform..
[3]
R. Weksberg,et al.
Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray
,
2013,
Epigenetics.
[4]
R. Tibshirani,et al.
A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis.
,
2009,
Biostatistics.
[5]
Tim D Spector,et al.
The UK Adult Twin Registry (TwinsUK)
,
2006,
Twin Research and Human Genetics.
[6]
Jeanine J. Houwing-Duistermaat,et al.
Evaluation of O2PLS in Omics data integration
,
2016,
BMC Bioinformatics.
[7]
Johan Trygg,et al.
O2‐PLS, a two‐block (X–Y) latent variable regression (LVR) method with an integral OSC filter
,
2003
.
[8]
Alireza Moayyeri,et al.
The UK Adult Twin Registry (TwinsUK Resource)
,
2012,
Twin Research and Human Genetics.
[9]
Y. Benjamini,et al.
Controlling the false discovery rate: a practical and powerful approach to multiple testing
,
1995
.
[10]
Pablo Tamayo,et al.
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles
,
2005,
Proceedings of the National Academy of Sciences of the United States of America.
[11]
R. Tibshirani.
Regression Shrinkage and Selection via the Lasso
,
1996
.
[12]
Giovanni Parmigiani,et al.
Integrating diverse genomic data using gene sets
,
2011,
Genome Biology.
[13]
Jeanine J Houwing-Duistermaat,et al.
Secondary phenotype analysis in ascertained family designs: application to the Leiden longevity study
,
2016,
Statistics in medicine.
[14]
Tom H. Pringle,et al.
The human genome browser at UCSC.
,
2002,
Genome research.
[15]
Boris P. Hejblum,et al.
Group and sparse group partial least square approaches applied in genomics context
,
2015,
Bioinform..