A primer on correlation-based dimension reduction methods for multi-omics analysis

The continuing advances of omic technologies mean that it is now more tangible to measure the numerous features collectively reflecting the molecular properties of a sample. When multiple omic methods are used, statistical and computational approaches can exploit these large, connected profiles. Multi-omics is the integration of different omic data sources from the same biological sample. In this review, we focus on correlation-based dimension reduction approaches for single omic datasets, followed by methods for pairs of omics datasets, before detailing further techniques for three or more omic datasets. We also briefly detail network methods when three or more omic datasets are available and which complement correlation-oriented tools. To aid readers new to this area, these are all linked to relevant R packages that can implement these procedures. Finally, we discuss scenarios of experimental design and present road maps that simplify the selection of appropriate analysis methods. This review will guide researchers navigate the emerging methods for multi-omics and help them integrate diverse omic datasets appropriately and embrace the opportunity of population multi-omics.

[1]  T. Voet,et al.  Methods and applications for single-cell and spatial multi-omics , 2023, Nature Reviews Genetics.

[2]  T. Downing,et al.  Informing plasmid compatibility with bacterial hosts using protein-protein interaction data , 2022, bioRxiv.

[3]  Brielin C. Brown,et al.  Multiset correlation and factor analysis enables exploration of multi-omic data , 2022, bioRxiv.

[4]  Alexander D. Rahm,et al.  Bacterial plasmid-associated and chromosomal proteins have fundamentally different properties in protein interaction networks , 2022, bioRxiv.

[5]  Zhiping Liu,et al.  tensorGSEA: Detecting Differential Pathways in Type 2 Diabetes via Tensor-Based Data Reconstruction , 2022, Interdisciplinary Sciences: Computational Life Sciences.

[6]  Rémi Flamary,et al.  Feature selection for kernel methods in systems biology , 2022, NAR genomics and bioinformatics.

[7]  Å. Wheelock,et al.  Multiomics integration-based molecular characterizations of COVID-19 , 2021, Briefings Bioinform..

[8]  H. Avron,et al.  Dimensionality reduction of longitudinal ’omics data using modern tensor factorizations , 2021, PLoS Comput. Biol..

[9]  Dehe Wang,et al.  MVIP: multi-omics portal of viral infection , 2021, Nucleic Acids Res..

[10]  J. S. Marron,et al.  Jackstraw inference for AJIVE data integration , 2021, Comput. Stat. Data Anal..

[11]  Arnaud Droit,et al.  timeOmics: an R package for longitudinal multi-omics data integration , 2021, Bioinform..

[12]  Sun Kim,et al.  MONTI: A Multi-Omics Non-negative Tensor Decomposition Framework for Gene-Level Integrative Analysis , 2021, Frontiers in Genetics.

[13]  L. Pachter,et al.  The specious art of single-cell genomics , 2021, bioRxiv.

[14]  Tsung-Hui Chang,et al.  TSCCA: A tensor sparse CCA method for detecting microRNA-gene patterns from multiple cancers , 2021, PLoS Comput. Biol..

[15]  Ana Conesa,et al.  Undisclosed, unmet and neglected challenges in multi-omics studies , 2021, Nature Computational Science.

[16]  O. Stegle,et al.  MUON: multimodal omics analysis framework , 2021, bioRxiv.

[17]  E. Trucco,et al.  Using machine learning approaches for multi-omics data analysis: A review. , 2021, Biotechnology advances.

[18]  Joshua D. Welch,et al.  Iterative single-cell multi-omic integration using online learning , 2021, Nature Biotechnology.

[19]  Christopher C. Gill,et al.  Four-Dimensional Sparse Bayesian Tensor Decomposition for Gene Expression Data , 2020, bioRxiv.

[20]  Arnaud Droit,et al.  Interpretation of network-based integration from multi-omics longitudinal data , 2020, bioRxiv.

[21]  Hae-Won Uh,et al.  Statistical integration of two omics datasets using GO2PLS , 2020, bioRxiv.

[22]  J. Tegnér,et al.  Harmonization of quality metrics and power calculation in multi-omic studies , 2020, Nature Communications.

[23]  Aaron J. Elmore,et al.  Debunking Four Long-Standing Misconceptions of Time-Series Distance Measures , 2020, SIGMOD Conference.

[24]  Alexander D. Rahm,et al.  Plasmids shape the diverse accessory resistomes of Escherichia coli ST131 , 2020, bioRxiv.

[25]  Eun Jeong Min,et al.  Sparse multiple co-Inertia analysis with application to integrative analysis of multi -Omics data , 2020, BMC Bioinformatics.

[26]  Cesare Furlanello,et al.  Integrative Network Fusion: A Multi-Omics Approach in Molecular Profiling , 2020, bioRxiv.

[27]  Nam D Nguyen,et al.  Multiview learning for understanding functional multiomics , 2020, PLoS Comput. Biol..

[28]  Pierre Veyre,et al.  Evaluation of integrative clustering methods for the analysis of multi-omics data , 2019, Briefings Bioinform..

[29]  Alberto Ferrer,et al.  MultiBaC: A strategy to remove batch effects between different omic data types , 2020, Statistical methods in medical research.

[30]  Hannes P. Eggertsson,et al.  GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs , 2019, Nature Communications.

[31]  S. Beck,et al.  Making multi-omics data accessible to researchers , 2019, Scientific Data.

[32]  M. Berriman,et al.  Genomic and Metabolomic Polymorphism among Experimentally Selected Paromomycin-Resistant Leishmania donovani Strains , 2019, Antimicrobial Agents and Chemotherapy.

[33]  Corrado Priami,et al.  Multi-omics integration - a comparison of unsupervised clustering methodologies , 2019, Briefings Bioinform..

[34]  Katerina Kechris,et al.  Unsupervised discovery of phenotype-specific multi-omics networks , 2019, Bioinform..

[35]  Eun Jeong Min,et al.  Penalized co‐inertia analysis with applications to ‐omics data , 2019, Bioinform..

[36]  Kim-Anh Lê Cao,et al.  DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays , 2019, Bioinform..

[37]  Yu Jiang,et al.  A Selective Review of Multi-Level Omics Data Integration Using Variable Selection , 2019, High-throughput.

[38]  Martin T. Wells,et al.  rTensor: An R Package for Multidimensional Array (Tensor) Unfolding, Multiplication, and Decomposition , 2018 .

[39]  R. Shamir,et al.  Multi-omic and multi-view clustering algorithms: review and cancer benchmark , 2018, Nucleic acids research.

[40]  Paul Geladi,et al.  OnPLS-Based Multi-Block Data Integration: A Multivariate Approach to Interrogating Biological Interactions in Asthma , 2018, Analytical chemistry.

[41]  Geurt Jongbloed,et al.  Integrating omics datasets with the OmicsPLS package , 2018, BMC Bioinformatics.

[42]  Alexander V. Favorov,et al.  Enter the Matrix: Factorization Uncovers Knowledge from Omics , 2018, Trends in genetics : TIG.

[43]  L. Jensen,et al.  Viruses.STRING: A Virus-Host Protein-Protein Interaction Database , 2018, bioRxiv.

[44]  R. Shamir,et al.  Multi-omic and multi-view clustering algorithms: review and cancer benchmark , 2018, bioRxiv.

[45]  Markus Reichstein,et al.  dimRed and coRanking - Unifying Dimensionality Reduction in R , 2018, R J..

[46]  Raquel Rodríguez-Pérez,et al.  Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study , 2018, Analytical and Bioanalytical Chemistry.

[47]  Shreyas Ananthan,et al.  A large-scale analysis of bioinformatics code on GitHub , 2018, bioRxiv.

[48]  J. Marioni,et al.  Multi‐Omics Factor Analysis—a framework for unsupervised integration of multi‐omics data sets , 2018, Molecular systems biology.

[49]  S. Drăghici,et al.  A novel approach for data integration and disease subtyping , 2017, Genome research.

[50]  Giri Narasimhan,et al.  So you think you can PLS-DA? , 2017, BMC Bioinformatics.

[51]  Gift Nyamundanda,et al.  A Novel Statistical Method to Diagnose, Quantify and Correct Batch Effects in Genomic Studies , 2017, Scientific Reports.

[52]  Y-H Taguchi,et al.  Tensor decomposition-based unsupervised feature extraction applied to matrix products for multi-view data processing , 2017, PloS one.

[53]  Daniel S. Katz,et al.  Four simple recommendations to encourage best practices in research software , 2017, F1000Research.

[54]  Anaïs Baudot,et al.  Random Walk With Restart on Multiplex and Heterogeneous Biological Networks , 2017, bioRxiv.

[55]  Xiaoyu Jiang,et al.  IPF-LASSO: Integrative L 1-Penalized Regression with Penalty Factors for Prediction Based on Multi-Omics Data , 2017, Comput. Math. Methods Medicine.

[56]  Robert Petryszak,et al.  Discovering and linking public omics data sets using the Omics Discovery Index , 2017, Nature Biotechnology.

[57]  N. Blomberg,et al.  General guidelines for biomedical software development , 2017, F1000Research.

[58]  Pablo G. Cámara,et al.  Topological methods for genomics: present and future directions. , 2017, Current opinion in systems biology.

[59]  Maja Pantic,et al.  TensorLy: Tensor Learning in Python , 2016, J. Mach. Learn. Res..

[60]  Eric F. Lock,et al.  R.JIVE for exploration of multi-source molecular data , 2016, Bioinform..

[61]  Jos Kleinjans,et al.  Transcriptomic and metabolomic data integration , 2016, Briefings Bioinform..

[62]  Stéphanie Bougeard,et al.  MINT: a multivariate integrative method to identify reproducible molecular signatures across independent experiments and platforms , 2016, BMC Bioinformatics.

[63]  M. McCarthy,et al.  Tensor decomposition for multi-tissue gene expression experiments , 2016, Nature Genetics.

[64]  Jorge Cadima,et al.  Principal component analysis: a review and recent developments , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[65]  Ke Deng,et al.  High-dimensional genomic data bias correction and data integration using MANCIE , 2016, Nature Communications.

[66]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[67]  Aedín C. Culhane,et al.  Dimension reduction techniques for the integrative analysis of multi-omics data , 2016, Briefings Bioinform..

[68]  L. Milanesi,et al.  Methods for the integration of multi-omics data: mathematical aspects , 2016, BMC Bioinformatics.

[69]  George Michailidis,et al.  A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data , 2015, Bioinform..

[70]  Boris P. Hejblum,et al.  Group and sparse group partial least square approaches applied in genomics context , 2015, Bioinform..

[71]  E. Hovig,et al.  Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses , 2015, Biostatistics.

[72]  Yong Luo,et al.  Tensor Canonical Correlation Analysis for Multi-View Dimension Reduction , 2015, IEEE Transactions on Knowledge and Data Engineering.

[73]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[74]  J. Pagès Multiple Factor Analysis by Example Using R , 2014 .

[75]  Magne Thoresen,et al.  Integrative clustering of high-dimensional data with joint and individual clusters. , 2014, Biostatistics.

[76]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[77]  S. Dudoit,et al.  Normalization of RNA-seq data using factor analysis of control genes or samples , 2014, Nature Biotechnology.

[78]  Yan Guo,et al.  Advanced Heat Map and Clustering Analysis Using Heatmap3 , 2014, BioMed research international.

[79]  V. Frouin,et al.  Variable selection for generalized canonical correlation analysis. , 2014, Biostatistics.

[80]  Haroon Idrees,et al.  NMF-KNN: Image Annotation Using Weighted Multi-view Non-negative Matrix Factorization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[81]  Feiping Nie,et al.  Robust Manifold Nonnegative Matrix Factorization , 2014, ACM Trans. Knowl. Discov. Data.

[82]  Aedín C. Culhane,et al.  A multivariate approach to the integration of multi-omics datasets , 2014, BMC Bioinformatics.

[83]  Marco Giordan,et al.  A Two-Stage Procedure for the Removal of Batch Effects in Microarray Studies , 2013, Statistics in Biosciences.

[84]  Paolo Giordani,et al.  Three-Way Component Analysis Using the R Package ThreeWay , 2014 .

[85]  R. Brereton,et al.  Partial least squares discriminant analysis: taking the magic away , 2014 .

[86]  Stéphanie Bougeard,et al.  Algorithms for multi‐group PLS , 2014 .

[87]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[88]  Arthur Tenenhaus,et al.  Regularized generalized canonical correlation analysis for multiblock or multigroup data analysis , 2013, Eur. J. Oper. Res..

[89]  Jean-Pierre A. Kocher,et al.  A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis , 2013, Bioinform..

[90]  Tommy Löfstedt,et al.  Global, local and unique decompositions in OnPLS for multiblock data analysis. , 2013, Analytica chimica acta.

[91]  Marinka Zitnik,et al.  Data Fusion by Matrix Factorization , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[92]  Momiao Xiong,et al.  Canonical correlation analysis for RNA-seq co-expression networks , 2013, Nucleic acids research.

[93]  H. Abdi,et al.  Multiple factor analysis: principal component analysis for multitable and multiblock data sets , 2013 .

[94]  David B. Dunson,et al.  Bayesian consensus clustering , 2013, Bioinform..

[95]  Zoubin Ghahramani,et al.  Bayesian correlated clustering to integrate multiple datasets , 2012, Bioinform..

[96]  Shi-Hua Zhang,et al.  Identifying multi-layer gene regulatory modules from multi-dimensional genomic data , 2012, Bioinform..

[97]  Ana Conesa,et al.  ARSyN: a method for the identification and removal of systematic noise in multifactorial time course microarray experiments. , 2012, Biostatistics.

[98]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[99]  R. Guthke,et al.  Batch correction of microarray data substantially improves the identification of genes differentially expressed in Rheumatoid Arthritis and Osteoarthritis , 2012, BMC Medical Genomics.

[100]  E. Beh,et al.  A GENEALOGY OF CORRESPONDENCE ANALYSIS , 2012 .

[101]  Michael F. Ochs,et al.  Matrix factorization for transcriptional regulatory network inference , 2012, 2012 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[102]  Timothy M. D. Ebbels,et al.  Integrated pathway-level analysis of transcriptomics and metabolomics data with IMPaLA , 2011 .

[103]  E. Holmes,et al.  Why do RNA viruses recombine? , 2011, Nature Reviews Microbiology.

[104]  Philippe Besse,et al.  Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems , 2011, BMC Bioinformatics.

[105]  Juan Liu,et al.  A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules , 2011, Bioinform..

[106]  G. Carlsson,et al.  Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival , 2011, Proceedings of the National Academy of Sciences.

[107]  Eric F Lock,et al.  JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. , 2011, The annals of applied statistics.

[108]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[109]  Russ B. Altman,et al.  Independent component analysis: Mining microarray data for fundamental human gene expression modules , 2010, J. Biomed. Informatics.

[110]  Martin Dugas,et al.  Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data , 2010, BMC Bioinformatics.

[111]  El Mostafa Qannari,et al.  Analysis of -omics data: Graphical interpretation- and validation tools in multi-block methods , 2010 .

[112]  Ana Conesa,et al.  A multiway approach to data integration in systems biology based on Tucker3 and N-PLS , 2010 .

[113]  Renaud Gaujoux,et al.  A flexible R package for nonnegative matrix factorization , 2010, BMC Bioinformatics.

[114]  Didier G. Leibovici,et al.  Spatio-Temporal Multiway Data Decomposition Using Principal Tensor Analysis on k-Modes: The R Package PTAk , 2010 .

[115]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[116]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[117]  Philippe Besse,et al.  Sparse canonical methods for biological data integration: application to a cross-platform study , 2009, BMC Bioinformatics.

[118]  S. Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[119]  Philippe Besse,et al.  Statistical Applications in Genetics and Molecular Biology A Sparse PLS for Variable Selection when Integrating Omics Data , 2011 .

[120]  Joachim M. Buhmann,et al.  Expectation-maximization for sparse and non-negative PCA , 2008, ICML '08.

[121]  K. Devarajan Nonnegative Matrix Factorization: An Analytical and Interpretive Tool in Computational Biology , 2008, PLoS Comput. Biol..

[122]  Sébastien Lê,et al.  FactoMineR: An R Package for Multivariate Analysis , 2008 .

[123]  Age K. Smilde,et al.  UvA-DARE ( Digital Academic Repository ) Assessment of PLSDA cross validation , 2008 .

[124]  Peter Langfelder,et al.  Eigengene networks for studying the relationships between co-expression modules , 2007, BMC Systems Biology.

[125]  Anne-Béatrice Dufour,et al.  The ade4 Package: Implementing the Duality Diagram for Ecologists , 2007 .

[126]  Age K. Smilde,et al.  Discovering gene expression patterns in time course microarray experiments by ANOVA-SCA , 2007, Bioinform..

[127]  Pierre-Antoine Absil,et al.  Elucidating the Altered Transcriptional Programs in Breast Cancer using Independent Component Analysis , 2007, PLoS Comput. Biol..

[128]  Hyunsoo Kim,et al.  Sparse Non-negative Matrix Factorizations via Alternating Non-negativity-constrained Least Squares , 2006 .

[129]  Joachim Selbig,et al.  pcaMethods - a bioconductor package providing PCA methods for incomplete data , 2007, Bioinform..

[130]  Michael Greenacre,et al.  Correspondence Analysis in R, with Two- and Three-dimensional Graphics: The ca Package , 2007 .

[131]  Mohammed Bennamoun,et al.  1D-PCA, 2D-PCA to nD-PCA , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[132]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[133]  J. J. Jansen,et al.  ASCA: analysis of multivariate data obtained from an experimental design , 2005 .

[134]  Lei Wang,et al.  Generalized 2D principal component analysis for face image representation and recognition , 2005, Neural Networks.

[135]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[136]  Vasil Simeonov,et al.  STATIS, a three-way method for data analysis. Application to environmental data , 2004 .

[137]  A. Barrett,et al.  Ngari Virus Is a Bunyamwera Virus Reassortant That Can Be Associated with Large Outbreaks of Hemorrhagic Fever in Africa , 2004, Journal of Virology.

[138]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[139]  Guy Perrière,et al.  Cross-platform comparison and visualisation of gene expression data using co-inertia analysis , 2003, BMC Bioinformatics.

[140]  Jean Thioulouse,et al.  CO‐INERTIA ANALYSIS AND THE LINKING OF ECOLOGICAL DATA TABLES , 2003 .

[141]  S. Batzoglou,et al.  Application of independent component analysis to microarrays , 2003, Genome Biology.

[142]  Philip M. Kim,et al.  Subsystem identification through dimensionality reduction of large-scale gene expression data. , 2003, Genome research.

[143]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[144]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[145]  Johan Trygg,et al.  O2‐PLS, a two‐block (X–Y) latent variable regression (LVR) method with an integral OSC filter , 2003 .

[146]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[147]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[148]  J. Macgregor,et al.  Analysis of multiblock and hierarchical PCA and PLS models , 1998 .

[149]  Henk A. L. Kiers,et al.  A three–step algorithm for CANDECOMP/PARAFAC analysis of large data sets with multicollinearity , 1998 .

[150]  R. Harshman,et al.  PARAFAC: parallel factor analysis , 1994 .

[151]  S. Dolédec,et al.  Co‐inertia analysis: an alternative method for studying species–environment relationships , 1994 .

[152]  J. Chang,et al.  Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[153]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[154]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[155]  L. Vietoris Über den höheren Zusammenhang kompakter Räume und eine Klasse von zusammenhangstreuen Abbildungen , 1927 .

[156]  Gregory V. Wilson,et al.  Four simple recommendations to encourage best practices in research software [version 1; referees: 3 approved] , 2017 .

[157]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[158]  Daniela M Witten,et al.  Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data , 2009, Statistical applications in genetics and molecular biology.

[159]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[160]  Yang Hui-jun,et al.  Tensor Canonical Correlation Analysis , 2008 .

[161]  P. Legendre,et al.  vegan : Community Ecology Package. R package version 1.8-5 , 2007 .

[162]  Michael F. Ochs,et al.  Determining Transcription Factor Activity from Microarray Data using Bayesian Markov Chain Monte Carlo Sampling , 2007, MedInfo.

[163]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[164]  Lei Wang,et al.  Generalized 2 D Principal Component Analysis , 2005 .

[165]  Jian Yang,et al.  Two-dimensional PCA: a new approach to appearance-based face representation and recognition , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[166]  Brian D. Ripley,et al.  Modern Applied Statistics with S Fourth edition , 2002 .

[167]  D. Chessel,et al.  Analyses de la co-inertie de K nuages de points , 1996 .

[168]  C. Goodall Procrustes methods in the statistical analysis of shape , 1991 .

[169]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .