Assessing the impact of transcriptomics data analysis pipelines on downstream functional enrichment results

Transcriptomics, and in particular RNA-Seq, has become a widely used approach to assess the molecular state of biological systems. To facilitate its analysis, many tools have been developed for different steps, such as filtering lowly expressed genes, normalisation, differential expression, and enrichment. While numerous studies have examined the impact of method choices on differential expression results, little attention has been paid to their effects on further downstream functional analysis using enrichment of gene sets, such as pathways, which typically provides the basis for interpretation and follow-up experiments. To address this gap, we introduce FLOP (FunctionaL Omics Processing), a comprehensive nextflow-based workflow that combines various methods for preprocessing and downstream enrichment analysis, allowing users to perform end-to-end analyses of count level transcriptomic data. We illustrate FLOP capabilities on diverse datasets comprising samples from end-stage heart failure patients and cancer cell lines in both basal and drug-perturbed states. We found that the correlation between gene set enrichment analysis results can vary significantly for alternative pipelines. Additionally, we observed that not filtering the data had the highest impact on the correlation between pipelines in the gene set space, especially in settings with limited statistical power. Overall, our results underscore the impact of carefully evaluating the consequences of the choice of preprocessing methods on downstream enrichment analyses. We envision FLOP as a valuable tool to measure the robustness of functional analyses, ultimately leading to more reliable and conclusive biological findings. Graphical abstract

[1]  Helena L. Crowell,et al.  Meta-analysis of (single-cell method) benchmarks reveals the need for extensibility and interoperability , 2023, bioRxiv.

[2]  W. Huber,et al.  Comparison of transformations for single-cell RNA-seq data , 2021, bioRxiv.

[3]  M. Ziemann,et al.  Urgent need for consistent standards in functional enrichment analysis , 2022, PLoS Comput. Biol..

[4]  George V. Popescu,et al.  NetSeekR: a network analysis pipeline for RNA-Seq time series data , 2022, BMC Bioinform..

[5]  Jana M. Braunger,et al.  decoupleR: ensemble of computational methods to infer biological activities from omics data , 2021, bioRxiv.

[6]  Robert J. Allaway,et al.  A Community Challenge for Pancancer Drug Mechanism of Action Inference from Perturbational Profile Data , 2020, bioRxiv.

[7]  Martin Hölzer,et al.  RNAflow: An Effective and Simple RNA-Seq Differential Gene Expression Pipeline Using Nextflow , 2020, Genes.

[8]  Christian H. Holland,et al.  A Consensus Transcriptional Landscape of Human End-Stage Heart Failure , 2020, medRxiv.

[9]  Sora Yoon,et al.  Benchmarking RNA-seq differential expression analysis methods using spike-in and simulation data , 2020, PloS one.

[10]  Mark D. Robinson,et al.  pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single cell RNA-seq preprocessing tools , 2020, Genome Biology.

[11]  O. Wever-Pinzon,et al.  DNA Methylation Reprograms Cardiac Metabolic Gene Expression in End-Stage Human Heart Failure. , 2019, American journal of physiology. Heart and circulatory physiology.

[12]  Steven L Salzberg,et al.  Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype , 2019, Nature Biotechnology.

[13]  J. Hadfield,et al.  RNA sequencing: the teenage years , 2019, Nature Reviews Genetics.

[14]  Ralf Zimmer,et al.  Toward a gold standard for benchmarking gene set enrichment analysis , 2019, bioRxiv.

[15]  Tyler H. Garvin,et al.  Genome-wide fetalization of enhancer architecture in heart disease , 2019, bioRxiv.

[16]  Thomas P. Quinn,et al.  Benchmarking differential expression analysis tools for RNA-Seq: normalization-based vs. log-ratio transformation-based methods , 2018, BMC Bioinformatics.

[17]  J. Sáez-Rodríguez,et al.  Benchmark and integration of resources for the estimation of human transcription factor activities , 2018, bioRxiv.

[18]  Juliana Costa-Silva,et al.  RNA-Seq differential expression analysis: An extended review and a software tool , 2017, PloS one.

[19]  Hugo Y. K. Lam,et al.  Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis , 2017, Nature Communications.

[20]  J. Polańska,et al.  Ranking metrics in gene set enrichment analysis: do they matter? , 2017, BMC Bioinformatics.

[21]  A. Ciccodicola,et al.  Heart failure: Pilot transcriptomic analysis of cardiac tissue by RNA-sequencing. , 2017, Cardiology journal.

[22]  M. Claros,et al.  DEgenes Hunter - A Flexible R Pipeline for Automated RNA-seq Studies in Organisms without Reference Genome , 2017 .

[23]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[24]  J. Sáez-Rodríguez,et al.  Perturbation-response genes reveal signaling footprints in cancer gene expression , 2016, Nature Communications.

[25]  May D. Wang,et al.  Effect of low-expression gene filtering on detection of differentially expressed genes in RNA-seq data , 2015, 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[26]  H. Shon,et al.  Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data , 2015, BMC Bioinformatics.

[27]  Keun Ho Ryu,et al.  Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data , 2015, BMC Bioinformatics.

[28]  Wolfgang Huber,et al.  RNA-Seq workflow: gene-level exploratory analysis and differential expression , 2015, F1000Research.

[29]  L. Elo,et al.  ROTS: reproducible RNA-seq biomarker detector—prognostic markers for clear cell renal cell cancer , 2015, Nucleic acids research.

[30]  Euan A Ashley,et al.  RNA-Seq identifies novel myocardial gene expression signatures of heart failure. , 2015, Genomics.

[31]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[32]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[33]  Lana X Garmire,et al.  Power analysis and sample size estimation for RNA-Seq differential expression , 2014, RNA.

[34]  G. Ewald,et al.  Deep RNA Sequencing Reveals Dynamic Regulation of Myocardial Noncoding RNAs in Failing Human Heart and Remodeling With Mechanical Circulatory Support , 2014, Circulation.

[35]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[36]  C. Mason,et al.  Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data , 2013, Genome Biology.

[37]  Charlotte Soneson,et al.  A comparison of methods for differential expression analysis of RNA-seq data , 2013, BMC Bioinformatics.

[38]  I. Nookaew,et al.  Enriching the gene set analysis of genome-wide data by incorporating directionality of gene expression and combining statistical hypotheses and methods , 2013, Nucleic acids research.

[39]  Adam A. Margolin,et al.  The Cancer Cell Line Encyclopedia enables predictive modeling of anticancer drug sensitivity , 2012, Nature.

[40]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[41]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[42]  Davis J. McCarthy,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[43]  Brad T. Sherman,et al.  The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists , 2007, Genome Biology.

[44]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[46]  Martin Vingron,et al.  Variance stabilization applied to microarray data calibration and to the quantification of differential expression , 2002, ISMB.

[47]  Javier Cabrera,et al.  Analysis of Data From Viral DNA Microchips , 2001 .

[48]  Richard Ingram,et al.  Power analysis and sample size estimation , 1998 .

[49]  Monther Alhamdoosh,et al.  RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR , 2016, F1000Research.

[50]  J. Mesirov,et al.  The Molecular Signatures Database (MSigDB) hallmark gene set collection. , 2015, Cell systems.

[51]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[52]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[53]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .