multiomics: A user-friendly multi-omics data harmonisation R pipeline

Data from multiple omics layers of a biological system is growing in quantity, heterogeneity and dimensionality. Simultaneous multi-omics data integration is a growing field of research as it has strong potential to unlock information on previously hidden biological relationships leading to early diagnosis, prognosis and expedited treatments. Many tools for multi-omics data integration are being developed. However, these tools are often restricted to highly specific experimental designs, and types of omics data. While some general methods do exist, they require specific data formats and experimental conditions. A major limitation in the field is a lack of a single or multi-omics pipeline which can accept data in an unrefined, information-rich form pre-integration and subsequently generate output for further investigation. There is an increasing demand for a generic multi-omics pipeline to facilitate general-purpose data exploration and analysis of heterogeneous data. Therefore, we present our R multiomics pipeline as an easy to use and flexible pipeline that takes unrefined multi-omics data as input, sample information and user-specified parameters to generate a list of output plots and data tables for quality control and downstream analysis. We have demonstrated application of the pipeline on two separate COVID-19 case studies. We enabled limited checkpointing where intermediate output is staged to allow continuation after errors or interruptions in the pipeline and generate a script for reproducing the analysis to improve reproducibility. A seamless integration with the mixOmics R package is achieved, as the R data object can be loaded and manipulated with mixOmics functions. Our pipeline can be installed as an R package or from the git repository, and is accompanied by detailed documentation with walkthroughs on two case studies. The pipeline is also available as Docker and Singularity containers.

[1]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[2]  Kim-Anh Lê Cao,et al.  A novel approach for biomarker selection and the integration of repeated measures experiments from two assays , 2012, BMC Bioinformatics.

[3]  Dain R. Brademan,et al.  Large-Scale Multi-omic Analysis of COVID-19 Severity , 2020, Cell Systems.

[4]  Luis Serrano,et al.  Correlation of mRNA and protein in complex biological samples , 2009, FEBS letters.

[5]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[6]  William Ritchie,et al.  Genome-wide characterization of the routes to pluripotency , 2014, Nature.

[7]  Kim-Anh Lê Cao,et al.  mixOmics: An R package for ‘omics feature selection and multiple data integration , 2017, bioRxiv.

[8]  David L. A. Wood,et al.  Divergent reprogramming routes lead to alternative stem-cell states , 2014, Nature.

[9]  Nuno A. Fonseca,et al.  ArrayExpress update – from bulk to single-cell expression data , 2018, Nucleic Acids Res..

[10]  A catalog of microbial genes from the bovine rumen unveils a specialized and diverse biomass-degrading environment , 2020, GigaScience.

[11]  David L. A. Wood,et al.  An epigenomic roadmap to induced pluripotency reveals DNA methylation as a reprogramming modulator , 2014, Nature Communications.

[12]  Philippe Besse,et al.  Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems , 2011, BMC Bioinformatics.

[13]  Ignacio González,et al.  Visualising associations between paired ‘omics’ data sets , 2012, BioData Mining.

[14]  Philippe Besse,et al.  Statistical Applications in Genetics and Molecular Biology A Sparse PLS for Variable Selection when Integrating Omics Data , 2011 .

[15]  David L. A. Wood,et al.  Small RNA changes en route to distinct cellular states of induced pluripotency , 2014, Nature Communications.

[16]  Gregory M. Kurtzer,et al.  Singularity 2.1.2 - Linux application and environment containers for science , 2016 .

[17]  S. Ciesek,et al.  Proteomics of SARS-CoV-2-infected host cells reveals therapy targets , 2020, Nature.

[18]  Kim-Anh Lê Cao,et al.  DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays , 2019, Bioinform..

[19]  Trey Ideker,et al.  Cytoscape 2.8: new features for data integration and network visualization , 2010, Bioinform..

[20]  Tyrone Chen,et al.  A multi-modal data harmonisation approach for discovery of COVID-19 drug targets , 2021, Briefings Bioinform..

[21]  C. Ponting,et al.  Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity , 2015, Nature Methods.

[22]  G. Sanguinetti,et al.  Multi-omics profiling of mouse gastrulation at single cell resolution , 2019, Nature.

[23]  Toshihisa Takagi,et al.  DNA Data Bank of Japan , 2016, Nucleic Acids Res..

[24]  Scott Chacon,et al.  Pro Git , 2009, Apress.

[25]  Sonika Tyagi,et al.  Integrative computational epigenomics to build data-driven gene regulation hypotheses , 2020, GigaScience.

[26]  David L. A. Wood,et al.  Proteome adaptation in cell reprogramming proceeds via distinct transcriptional networks , 2014, Nature Communications.

[27]  Age K. Smilde,et al.  Multivariate paired data analysis: multilevel PLSDA versus OPLSDA , 2009, Metabolomics.

[28]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[29]  Georgios A. Pavlopoulos,et al.  Caipirini: using gene sets to rank literature , 2012, BioData Mining.