DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays

Motivation In the continuously expanding omics era, novel computational and statistical strategies are needed for data integration and identification of biomarkers and molecular signatures. We present DIABLO, a multi-omics integrative method that seeks for common information across different data types through the selection of a subset of molecular features, while discriminating between multiple phenotypic groups. Results Using simulations and benchmark multi-omics studies, we show that DIABLO identifies features with superior biological relevance compared to existing unsupervised integrative methods, while achieving predictive performance comparable to state-of-the-art supervised approaches. DIABLO is versatile, allowing for modular-based analyses and cross-over study designs. In two case studies, DIABLO identified both known and novel multi-omics biomarkers consisting of mRNAs, miRNAs, CpGs, proteins and metabolites. Availability DIABLO is implemented in the mixOmics R Bioconductor package with functions for parameters choise and visualisation to assist in the interpretation of the integrative analyses, along with tutorials on http://mixomics.org and in our Bioconductor vignette. Suppl. information Supplementary information is available at Bioinformatics online.

[1]  Kim-Anh Lê Cao,et al.  A novel approach for biomarker selection and the integration of repeated measures experiments from two assays , 2012, BMC Bioinformatics.

[2]  Rachel B. Brem,et al.  Stitching together Multiple Data Dimensions Reveals Interacting Metabolomic and Transcriptomic Networks That Modulate Cell Regulation , 2012, PLoS biology.

[3]  Nathalie Villa-Vialaneix,et al.  Unsupervised multiple kernel learning for heterogeneous data integration , 2017, bioRxiv.

[4]  V. Frouin,et al.  Variable selection for generalized canonical correlation analysis. , 2014, Biostatistics.

[5]  J. Marioni,et al.  Multi‐Omics Factor Analysis—a framework for unsupervised integration of multi‐omics data sets , 2018, Molecular systems biology.

[6]  Philippe Besse,et al.  Statistical Applications in Genetics and Molecular Biology A Sparse PLS for Variable Selection when Integrating Omics Data , 2011 .

[7]  A. Tenenhaus,et al.  Regularized Generalized Canonical Correlation Analysis , 2011, Eur. J. Oper. Res..

[8]  P. Laird,et al.  Discovery of multi-dimensional modules by integrative analysis of cancer genomic data , 2012, Nucleic acids research.

[9]  Aedín C. Culhane,et al.  Dimension reduction techniques for the integrative analysis of multi-omics data , 2016, Briefings Bioinform..

[10]  Raymond T. Ng,et al.  A computational pipeline for the development of multi-marker bio-signature panels and ensemble classifiers , 2012, BMC Bioinformatics.

[11]  Amin Allahyar,et al.  FERAL: network-based classifier with application to breast cancer outcome prediction , 2015, Bioinform..

[12]  Lodewyk F. A. Wessels,et al.  TANDEM: a two-stage approach to maximize interpretability of drug response models based on multiple molecular data types , 2016, Bioinform..

[13]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2004, Nucleic Acids Res..

[14]  Thomas J. Wang,et al.  Assessing the Role of Circulating, Genetic, and Imaging Biomarkers in Cardiovascular Risk Prediction , 2011, Circulation.

[15]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[16]  Artem Sokolov,et al.  Pathway-Based Genomics Prediction using Generalized Elastic Net , 2016, PLoS Comput. Biol..

[17]  Marylyn D. Ritchie,et al.  ATHENA: Identifying interactions between different levels of genomic data associated with cancer clinical outcomes using grammatical evolution neural network , 2013, BioData Mining.

[18]  Zoubin Ghahramani,et al.  Bayesian correlated clustering to integrate multiple datasets , 2012, Bioinform..

[19]  Wessel N van Wieringen,et al.  Better prediction by use of co‐data: adaptive group‐regularized ridge regression , 2014, Statistics in medicine.

[20]  C. Huttenhower,et al.  Passing Messages between Biological Networks to Refine Predicted Interactions , 2013, PloS one.

[21]  M. Ritchie,et al.  Methods of integrating data to uncover genotype–phenotype interactions , 2015, Nature Reviews Genetics.

[22]  Shi-Hua Zhang,et al.  Identifying multi-layer gene regulatory modules from multi-dimensional genomic data , 2012, Bioinform..

[23]  Kim-Anh Lê Cao,et al.  mixOmics: An R package for ‘omics feature selection and multiple data integration , 2017, bioRxiv.

[24]  Juan Liu,et al.  A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules , 2011, Bioinform..

[25]  Philippe Besse,et al.  Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems , 2011, BMC Bioinformatics.

[26]  Holger Fröhlich,et al.  Network and Data Integration for Biomarker Signature Discovery via Network Smoothed T-Statistics , 2013, PloS one.

[27]  J. Mesirov,et al.  The Molecular Signatures Database Hallmark Gene Set Collection , 2015 .

[28]  Wei-Chung Cheng,et al.  DriverDBv2: a database for human cancer driver gene research , 2015, Nucleic Acids Res..

[29]  Eric F Lock,et al.  JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. , 2011, The annals of applied statistics.

[30]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[31]  Thomas Lumley,et al.  Review of Statistical Learning Methods in Integrated Omics Studies (An Integrated Information Science) , 2018, Bioinformatics and biology insights.

[32]  Masatsugu Yamamoto,et al.  Gene-Metabolite Expression in Blood Can Discriminate Allergen-Induced Isolated Early from Dual Asthmatic Responses , 2013, PloS one.

[33]  Ignacio González,et al.  Visualising associations between paired ‘omics’ data sets , 2012, BioData Mining.

[34]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[35]  Di Wu,et al.  miRCancer: a microRNA-cancer association database constructed by text mining on literature , 2013, Bioinform..

[36]  Masatsugu Yamamoto,et al.  Th17/Treg ratio derived using DNA methylation analysis is associated with the late phase asthmatic response , 2014, Allergy, Asthma & Clinical Immunology.

[37]  Hiroyuki Kubota,et al.  Trans-Omics: How To Reconstruct Biochemical Networks Across Multiple 'Omic' Layers. , 2016, Trends in biotechnology.

[38]  Lana X. Garmire,et al.  More Is Better: Recent Progress in Multi-Omics Data Integration Methods , 2017, Front. Genet..

[39]  Luciano Milanesi,et al.  Methods for the integration of multi-omics data: mathematical aspects , 2016, BMC Bioinformatics.

[40]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[41]  David Fenyö,et al.  Breast Cancer Prognostics Using Multi-Omics Data , 2016, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.