Comparison and evaluation of integrative methods for the analysis of multilevel omics data: a study based on simulated and experimental cancer data

Integrative analysis aims to identify the driving factors of a biological process by the joint exploration of data from multiple cellular levels. The volume of omics data produced is constantly increasing, and so too does the collection of tools for its analysis. Comparative studies assessing performance and the biological value of results, however, are rare but in great demand. We present a comprehensive comparison of three integrative analysis approaches, sparse canonical correlation analysis (sCCA), non-negative matrix factorization (NMF) and logic data mining MicroArray Logic Analyzer (MALA), by applying them to simulated and experimental omics data. We find that sCCA and NMF are able to identify differential features in simulated data, while the Logic Data Mining method, MALA, falls short. Applied to experimental data, we show that MALA performs best in terms of sample classification accuracy, and in general, the classification power of prioritized feature sets is high (97.1-99.5% accuracy). The proportion of features identified by at least one of the other methods, however, is approximately 60% for sCCA and NMF and nearly 30% for MALA, and the proportion of features jointly identified by all methods is only around 16%. Similarly, the congruence on functional levels (Gene Ontology, Reactome) is low. Furthermore, the agreement of identified feature sets with curated gene signatures relevant to the investigated disease is modest. We discuss possible reasons for the moderate overlap of identified feature sets with each other and with curated cancer signatures. The R code to create simulated data, results and figures is provided at https://github.com/ThallingerLab/IamComparison.

[1]  Giovanni Felici,et al.  CAMUR: Knowledge extraction from RNA-seq cancer data through equivalent classification rules , 2015, Bioinform..

[2]  W. Lam,et al.  Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells , 2005, Nature Genetics.

[3]  Lukasz A. Kurgan,et al.  CAIM discretization algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[4]  Daniela M Witten,et al.  Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data , 2009, Statistical applications in genetics and molecular biology.

[5]  Henning Hermjakob,et al.  The Reactome pathway Knowledgebase , 2015, Nucleic acids research.

[6]  Sijian Wang,et al.  SPARSE INTEGRATIVE CLUSTERING OF MULTIPLE OMICS DATA SETS. , 2013, The annals of applied statistics.

[7]  D. Tritchler,et al.  Sparse Canonical Correlation Analysis with Application to Genomic Data Integration , 2009, Statistical applications in genetics and molecular biology.

[8]  A. Zwinderman,et al.  Statistical Applications in Genetics and Molecular Biology Quantifying the Association between Gene Expressions and DNA-Markers by Penalized Canonical Correlation Analysis , 2011 .

[9]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[10]  W. Härdle,et al.  Applied Multivariate Statistical Analysis , 2003 .

[11]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[12]  Roded Sharan,et al.  Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Philippe Besse,et al.  Statistical Applications in Genetics and Molecular Biology A Sparse PLS for Variable Selection when Integrating Omics Data , 2011 .

[14]  Giovanni Felici,et al.  Logic classification and feature selection for biomedical data , 2008, Comput. Math. Appl..

[15]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[16]  L. Stein,et al.  Annotating Cancer Variants and Anti-Cancer Therapeutics in Reactome , 2012, Cancers.

[17]  C. Loan Generalizing the Singular Value Decomposition , 1976 .

[18]  Philippe Besse,et al.  Sparse canonical methods for biological data integration: application to a cross-platform study , 2009, BMC Bioinformatics.

[19]  A. Nobel,et al.  The molecular portraits of breast tumors are conserved across microarray platforms , 2006, BMC Genomics.

[20]  Yuan Ji,et al.  TCGA-Assembler 2: Software Pipeline for Retrieval and Processing of TCGA/CPTAC Data , 2017, bioRxiv.

[21]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[22]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[23]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[24]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[25]  Luciano Milanesi,et al.  Methods for the integration of multi-omics data: mathematical aspects , 2016, BMC Bioinformatics.

[26]  Davide Bedognetti,et al.  Prognostic and predictive immune gene signatures in breast cancer , 2015, Current opinion in oncology.

[27]  S. Dolédec,et al.  Co‐inertia analysis: an alternative method for studying species–environment relationships , 1994 .

[28]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[29]  P. Bertolazzi,et al.  Gene expression biomarkers in the brain of a mouse model for Alzheimer's disease: mining of microarray data by logic classification and feature selection. , 2011, Journal of Alzheimer's disease : JAD.

[30]  Zhiguang Huo,et al.  Integrative Sparse K-Means With Overlapping Group Lasso in Genomic Applications for Disease Subtype Discovery. , 2017, The annals of applied statistics.

[31]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[32]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[33]  Dongmei Li,et al.  An evaluation of statistical methods for DNA methylation microarray data analysis , 2015, BMC Bioinformatics.

[34]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[35]  Aedín C Culhane,et al.  A multivariate analysis approach to the integration of proteomic and gene expression data , 2007, Proteomics.

[36]  Aedín C. Culhane,et al.  iBBiG: iterative binary bi-clustering of gene sets , 2012, Bioinform..

[37]  P. Laird,et al.  Discovery of multi-dimensional modules by integrative analysis of cancer genomic data , 2012, Nucleic acids research.

[38]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[39]  Aedín C. Culhane,et al.  Dimension reduction techniques for the integrative analysis of multi-omics data , 2016, Briefings Bioinform..

[40]  D. Botstein,et al.  Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Richard Simon,et al.  A random variance model for detection of differential gene expression in small microarray experiments , 2003, Bioinform..

[42]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[43]  Hanbo Chen,et al.  VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R , 2011, BMC Bioinformatics.

[44]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[45]  Xiao Zhang,et al.  Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis , 2010, BMC Bioinformatics.

[46]  Peng Qiu,et al.  TCGA-Assembler: open-source software for retrieving and processing TCGA data , 2014, Nature Methods.

[47]  Klaus Truemper,et al.  Design of Logic-based Intelligent Systems: Truemper/Intelligent Systems , 2005 .

[48]  Chen Meng,et al.  moGSA : integrative single sample gene-set analysis of 1 multiple omics data 2 , 2016 .

[49]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Guy Perrière,et al.  Cross-platform comparison and visualisation of gene expression data using co-inertia analysis , 2003, BMC Bioinformatics.

[51]  Klaus Truemper,et al.  Design of logic-based intelligent systems , 2004 .

[52]  Panos M. Pardalos,et al.  Encyclopedia of Optimization , 2006 .

[53]  John Wang,et al.  Encyclopedia of Data Warehousing and Mining , 2005 .

[54]  Gerhard G. Thallinger,et al.  Integrative omics analysis. A study based on Plasmodium falciparum mRNA and protein data , 2014, BMC Systems Biology.

[55]  George C Tseng,et al.  Integrative clustering of multi-level omics data for disease subtype discovery using sequential double regularization. , 2017, Biostatistics.