SPEAQeasy: a scalable pipeline for expression analysis and quantification for R/bioconductor-powered RNA-seq analyses

Background RNA sequencing (RNA-seq) is a common and widespread biological assay, and an increasing amount of data is generated with it. In practice, there are a large number of individual steps a researcher must perform before raw RNA-seq reads yield directly valuable information, such as differential gene expression data. Existing software tools are typically specialized, only performing one step–such as alignment of reads to a reference genome–of a larger workflow. The demand for a more comprehensive and reproducible workflow has led to the production of a number of publicly available RNA-seq pipelines. However, we have found that most require computational expertise to set up or share among several users, are not actively maintained, or lack features we have found to be important in our own analyses. Results In response to these concerns, we have developed a Scalable Pipeline for Expression Analysis and Quantification (SPEAQeasy), which is easy to install and share, and provides a bridge towards R/Bioconductor downstream analysis solutions. SPEAQeasy is portable across computational frameworks (SGE, SLURM, local, docker integration) and different configuration files are provided ( http://research.libd.org/SPEAQeasy/ ). Conclusions SPEAQeasy is user-friendly and lowers the computational-domain entry barrier for biologists and clinicians to RNA-seq data processing as the main input file is a table with sample names and their corresponding FASTQ files. The goal is to provide a flexible pipeline that is immediately usable by researchers, regardless of their technical background or computing environment.

[1]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[2]  G. Hon,et al.  Next-generation genomics: an integrative approach , 2010, Nature Reviews Genetics.

[3]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[4]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[5]  Steven L Salzberg,et al.  Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype , 2019, Nature Biotechnology.

[6]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[7]  Rafael A. Irizarry,et al.  Flexible expressed region analysis for RNA-seq with derfinder , 2015, bioRxiv.

[8]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[9]  Mark Gerstein,et al.  GENCODE reference annotation for the human and mouse genomes , 2018, Nucleic Acids Res..

[10]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[11]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[12]  Tommy Minyard,et al.  Best practices for the deployment and management of production HPC clusters , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Wei Shi,et al.  featureCounts: an efficient general purpose program for assigning sequence reads to genomic features , 2013, Bioinform..

[14]  Liang-di Xie,et al.  Transcriptomic analysis identifies Toll‐like and Nod‐like pathways and necroptosis in pulmonary arterial hypertension , 2020, Journal of cellular and molecular medicine.

[15]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[16]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[17]  Jun Chen,et al.  A tool for RNA sequencing sample identity check , 2013, Bioinform..

[18]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[19]  Kevin Rue-Albrecht,et al.  iSEE: Interactive SummarizedExperiment Explorer , 2018, F1000Research.

[20]  Jeffrey T Leek,et al.  Reproducible RNA-seq analysis using recount2 , 2017, Nature Biotechnology.

[21]  Jeffrey T Leek,et al.  qSVA framework for RNA quality correction in differential expression analysis , 2017, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Galt P. Barber,et al.  BigWig and BigBed: enabling browsing of large distributed datasets , 2010, Bioinform..

[23]  Abhinav Nellore,et al.  Cloud computing for genomic data analysis and collaboration , 2018, Nature Reviews Genetics.

[24]  Katharina M. Hembach,et al.  ARMOR: An Automated Reproducible MOdular Workflow for Preprocessing and Differential Analysis of RNA-seq Data. , 2019, G3.

[25]  Leonardo Collado-Torres,et al.  Rail-RNA: Scalable analysis of RNA-seq splicing and coverage , 2015, bioRxiv.

[26]  Reinhard Guthke,et al.  GEO2RNAseq: An easy-to-use R pipeline for complete pre-processing of RNA-seq data , 2019 .

[27]  Joshua F. McMichael,et al.  Integrated analysis of genomic and transcriptomic data for the discovery of splice-associated variants in cancer , 2018, bioRxiv.

[28]  Jared L. Johnson,et al.  Identification of SARS-CoV-2 Inhibitors using Lung and Colonic Organoids , 2020, Nature.

[29]  Stylianos E. Antonarakis,et al.  MBV: a method to solve sample mislabeling and detect technical bias in large combined genotype and sequencing assay datasets , 2017, Bioinform..

[30]  J. Hadfield,et al.  RNA sequencing: the teenage years , 2019, Nature Reviews Genetics.

[31]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[32]  Bo Li,et al.  VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis , 2018, BMC Bioinformatics.

[33]  Sara Ballouz,et al.  The fractured landscape of RNA-seq alignment: the default in our STARs , 2017, bioRxiv.

[34]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[35]  D. Gautheret,et al.  Bridging the gap between reference and real transcriptomes , 2019, Genome Biology.

[36]  Jingyuan Fu,et al.  Calling genotypes from public RNA-sequencing data enables identification of genetic variants that affect gene-expression levels , 2014, Genome Medicine.

[37]  Måns Magnusson,et al.  MultiQC: summarize analysis results for multiple tools and samples in a single report , 2016, Bioinform..

[38]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[39]  Astrid Gall,et al.  Ensembl 2019 , 2018, Nucleic Acids Res..

[40]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[41]  Mark D. Robinson,et al.  ARMOR: An Automated Reproducible MOdular Workflow for Preprocessing and Differential Analysis of RNA-seq Data , 2019, G3: Genes, Genomes, Genetics.

[42]  Publisher's Note , 2018, Anaesthesia.

[43]  Emily E. Burke,et al.  Regional Heterogeneity in Gene Expression, Regulation, and Coherence in the Frontal Cortex and Hippocampus across Development and Schizophrenia , 2019, Neuron.

[44]  Emily E. Burke,et al.  Profiling gene expression in the human dentate gyrus granule cell layer reveals insights into schizophrenia and its genetic risk , 2020, Nature Neuroscience.

[45]  Guangchuang Yu,et al.  clusterProfiler: an R package for comparing biological themes among gene clusters. , 2012, Omics : a journal of integrative biology.

[46]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[47]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[48]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[49]  P. Striano,et al.  Loss of SMPD4 Causes a Developmental Disorder Characterized by Microcephaly and Congenital Arthrogryposis. , 2019, American journal of human genetics.

[50]  Kai Lu,et al.  Lightweight Container-based User Environment , 2019, ArXiv.

[51]  Cole Trapnell,et al.  How to map billions of short reads onto genomes , 2009, Nature Biotechnology.

[52]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[53]  Emily E. Burke,et al.  Dissecting transcriptomic signatures of neuronal differentiation and maturation using iPSCs , 2018, bioRxiv.

[54]  Emily E. Burke,et al.  Divergent neuronal DNA methylation patterns across human cortical development reveal critical periods and a unique role of CpH methylation , 2019, Genome Biology.

[55]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[56]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[57]  Marc Salit,et al.  External RNA Controls Consortium Beta Version Update , 2016, Journal of genomics.

[58]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[59]  Emily E. Burke,et al.  Dissecting transcriptomic signatures of neuronal differentiation and maturation using iPSCs , 2020, Nature Communications.

[60]  Wei Li,et al.  RSeQC: quality control of RNA-seq experiments , 2012, Bioinform..

[61]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[62]  Katharina M. Hembach,et al.  RNA Sequencing Data: Hitchhiker's Guide to Expression Analysis , 2018, Annual Review of Biomedical Data Science.