Increased comparability between RNA-Seq and microarray data by utilization of gene sets

The field of transcriptomics uses and measures mRNA as a proxy of gene expression. There are currently two major platforms in use for quantifying mRNA, microarray and RNA-Seq. Many comparative studies have shown that their results are not always consistent. In this study we aim to find a robust method to increase comparability of both platforms enabling data analysis of merged data from both platforms. We transformed high dimensional transcriptomics data from two different platforms into a lower dimensional, and biologically relevant dataset by calculating enrichment scores based on gene set collections for all samples. We compared the similarity between data from both platforms based on the raw data and on the enrichment scores. We show that the performed data transforms the data in a biologically relevant way and filters out noise which leads to increased platform concordance. We validate the procedure using predictive models built with microarray based enrichment scores to predict subtypes of breast cancer using enrichment scores based on sequenced data. Although microarray and RNA-Seq expression levels might appear different, transforming them into biologically relevant gene set enrichment scores significantly increases their correlation, which is a step forward in data integration of the two platforms. The gene set collections were shown to contain biologically relevant gene sets. More in-depth investigation on the effect of the composition, size, and number of gene sets that are used for the transformation is suggested for future research.

[1]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[2]  E. Blomme,et al.  Comparison of RNA-Seq and Microarray Gene Expression Platforms for the Toxicogenomic Evaluation of Liver From Short-Term Rat Toxicity Studies , 2019, Front. Genet..

[3]  Age K. Smilde,et al.  Real-life metabolomics data analysis : how to deal with complex data ? , 2010 .

[4]  J. Mesirov,et al.  The Molecular Signatures Database Hallmark Gene Set Collection , 2015 .

[5]  Jie Tan,et al.  Cross-platform normalization of microarray and RNA-seq data for machine learning applications , 2016, PeerJ.

[6]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[7]  P. Khaitovich,et al.  BMC Genomics BioMed Central Methodology article Estimating accuracy of RNA-Seq and microarrays with proteomics , 2022 .

[8]  Justin Guinney,et al.  GSVA: gene set variation analysis for microarray and RNA-Seq data , 2013, BMC Bioinformatics.

[9]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[10]  Bart De Moor,et al.  BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis , 2005, Bioinform..

[11]  Wonhee Jang,et al.  Meta-analysis of microarray and RNA-Seq gene expression datasets for carcinogenic risk: An assessment of Bisphenol A , 2017, Molecular & Cellular Toxicology.

[12]  E. Birney,et al.  Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt , 2009, Nature Protocols.

[13]  Marjan S. Bolouri,et al.  Integrated Analysis of Protein Composition, Tissue Diversity, and Gene Regulation in Mouse Mitochondria , 2003, Cell.

[14]  Adam A. Margolin,et al.  The Cancer Cell Line Encyclopedia enables predictive modeling of anticancer drug sensitivity , 2012, Nature.

[15]  J. Gower Generalized procrustes analysis , 1975 .

[16]  A. Bittner,et al.  Comparison of RNA-Seq and Microarray in Transcriptome Profiling of Activated T Cells , 2014, PloS one.

[17]  Ben S. Wittner,et al.  Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1 , 2009, Nature.