A powerful framework for an integrative study with heterogeneous omics data: from univariate statistics to multi-block analysis

High-throughput data generated by new biotechnologies require specific and adapted statistical treatment in order to be efficiently used in biological studies. In this article, we propose a powerful framework to manage and analyse multi-omics heterogeneous data to carry out an integrative analysis. We have illustrated this using the mixOmics package for R software as it specifically addresses data integration issues. Our work also aims at applying the most recent functionalities of mixOmics to real datasets. Although multi-block integrative methodologies exist, we hope to encourage a more widespread use of such approaches in an operational framework by biologists. We have used natural populations of the model plant Arabidopsis thaliana in this work, but the framework proposed is not limited to this plant and can be deployed whatever the organisms of interest and the biological question may be. Four omics datasets (phenomics, metabolomics, cell wall proteomics and transcriptomics) were collected, analysed and integrated to study the cell wall plasticity of plants exposed to sub-optimal temperature growth conditions. The methodologies presented here start from basic univariate statistics leading to multi-block integration analysis. We have also highlighted the fact that each method, either unsupervised or supervised, is associated with one biological issue. Using this powerful framework enabled us to arrive at novel conclusions on the biological system, which would not have been possible using standard statistical approaches.

[1]  S. Déjean,et al.  Phenotypic Trait Variation as a Response to Altitude-Related Constraints in Arabidopsis Populations , 2019, Front. Plant Sci..

[2]  Corrado Priami,et al.  Multi-omics integration - a comparison of unsupervised clustering methodologies , 2019, Briefings Bioinform..

[3]  K. Lertzman,et al.  Observations of climate change among subsistence-oriented communities around the world , 2016 .

[4]  Ruben P. Jolie,et al.  Comparative study of the cell wall composition of broccoli, carrot, and tomato: structural characterization of the extractable pectins and hemicelluloses. , 2011, Carbohydrate research.

[5]  Kim-Anh Lê Cao,et al.  DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays , 2019, Bioinform..

[6]  Rafael C. Jimenez,et al.  Teaching the Fundamentals of Biological Data Integration Using Classroom Games , 2012, PLoS Comput. Biol..

[7]  S. Brady,et al.  Plant developmental responses to climate change. , 2016, Developmental biology.

[8]  Kazuki Saito,et al.  Integrated omics analysis of specialized metabolism in medicinal plants. , 2017, The Plant journal : for cell and molecular biology.

[9]  Matthias H. Hoffmann,et al.  Biogeography of Arabidopsis thaliana (L.) Heynh. (Brassicaceae) , 2002 .

[10]  Edith Le Floch,et al.  Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration , 2019, Briefings Bioinform..

[11]  M. Kerr Experimental design to make the most of microarray studies. , 2003, Methods in molecular biology.

[12]  Hiroyoshi Taniguchi,et al.  Relevance network between chemosensitivity and transcriptome in human hepatoma cells. , 2003, Molecular cancer therapeutics.

[13]  Jason A. Papin,et al.  Ten simple rules for biologists learning to program , 2018, PLoS Comput. Biol..

[14]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[15]  J. Marioni,et al.  Multi‐Omics Factor Analysis—a framework for unsupervised integration of multi‐omics data sets , 2018, Molecular systems biology.

[16]  Ignacio González,et al.  Visualising associations between paired ‘omics’ data sets , 2012, BioData Mining.

[17]  A. Butte,et al.  Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Kim-Anh Lê Cao,et al.  mixOmics: An R package for ‘omics feature selection and multiple data integration , 2017, bioRxiv.

[19]  G. Agrawal,et al.  Omics – A New Approach to Sustainable Production , 2016 .

[20]  J. Selbig,et al.  More effort - more results: recent advances in integrative 'omics' data analysis. , 2016, Current opinion in plant biology.

[21]  Alioune Ngom,et al.  A review on machine learning principles for multi-view biological data integration , 2016, Briefings Bioinform..

[22]  V. Frouin,et al.  Variable selection for generalized canonical correlation analysis. , 2014, Biostatistics.

[23]  Jérôme Pagès,et al.  Multiple factor analysis and clustering of a mixture of quantitative, categorical and frequency data , 2008, Comput. Stat. Data Anal..

[24]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[25]  Luis Serrano,et al.  Correlation of mRNA and protein in complex biological samples , 2009, FEBS letters.

[26]  J. Renou,et al.  Cell wall biogenesis of Arabidopsis thaliana elongating cells: transcriptomics complements proteomics , 2009, BMC Genomics.

[27]  Ignacio González,et al.  Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework , 2016, BMC Bioinformatics.

[28]  S. Déjean,et al.  Cell wall modifications of two Arabidopsis thaliana ecotypes, Col and Sha, in response to sub-optimal growth conditions: An integrative study. , 2017, Plant science : an international journal of experimental plant biology.

[29]  M. Friendly Corrgrams , 2002 .

[30]  Sabrina Giordano,et al.  hmmm: An R Package for Hierarchical Multinomial Marginal Models , 2014 .

[31]  Ignacio González,et al.  integrOmics: an R package to unravel relationships between two omics datasets , 2009, Bioinform..

[32]  D. J. Murdoch,et al.  A Graphical Display of Large Correlation Matrices , 1996 .

[33]  Philippe Besse,et al.  Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems , 2011, BMC Bioinformatics.

[34]  C. Dunand,et al.  Transcriptomic and cell wall proteomic datasets of rosettes and floral stems from five Arabidopsis thaliana ecotypes grown at optimal or sub-optimal temperature , 2019, Data in brief.

[35]  C. Dunand,et al.  Phenotyping and cell wall polysaccharide composition dataset of five arabidopsis ecotypes grown at optimal or sub-optimal temperatures , 2019, Data in brief.