Arkas: Rapid, Reproducible RNAseq Analysis as a Service

The recently introduced Kallisto[1] pseudoaligner has radically simplified the quantification of transcripts in RNA-sequencing experiments. However, as with all computational advances, reproducibility across experiments requires attention to detail. The elegant approach of Kallisto reduces dependencies, but we noted differences in quantification between versions of Kallisto, and both upstream preparation and downstream interpretation benefit from an environment that enforces a requirement for equivalent processing when comparing groups of samples. Therefore, we created the Arkas[3] and TxDbLite[4] R packages to meet these needs and to ease cloud-scale deployment of the above. TxDbLite extracts structured information directly from source FASTA files with per-contig metadata, while Arkas enforces versioning of the derived indices and annotations, to ensure tight coupling of inputs and outputs while minimizing external dependencies. The two packages are combined in Illumina's BaseSpace cloud computing environment to offer a massively parallel and distributed quantification step for power users, loosely coupled to biologically informative downstream analyses via gene set analysis (with special focus on Reactome annotations for ENSEMBL transcriptomes). Previous work (e.g. Soneson et al., 2016[34]) has revealed that filtering transcriptomes to exclude lowly-expressed isoforms can improve statistical power, while more-complete transcriptome assemblies improve sensitivity in detecting differential transcript usage. Based on earlier work by Bourgon et al., 2010[11], we included this type of filtering for both gene- and transcript-level analyses within Arkas. For reproducible and versioned downstream analysis of results, we focused our efforts on ENSEMBL and Reac-tome[2] integration within the qusage[19] framework, adapted to take advantage of the parallel and distributed environment in Illumina’s BaseSpace cloud platform. We show that quantification and interpretation of repetitive sequence element transcription is eased in both basic and clinical studies by just-in-time annotation and visualization. The option to retain pseudoBAM output for structural variant detection and annotation, while not insignificant in its demand for computation and storage, nonetheless provides a middle ground between de novo transcriptome assembly and routine quantification, while consuming a fraction of the resources used by popular fusion detection pipelines and providing options to quantify gene fusions with known breakpoints without reassembly. Finally, we describe common use cases where investigators are better served by cloud-based computing platforms such as BaseSpace due to inherent efficiencies of scale and enlightened common self-interest. Our experiences suggest a common reference point for methods development, evaluation, and experimental interpretation.

[1]  Lincoln Stein,et al.  Reactome: a database of reactions, pathways and biological processes , 2010, Nucleic Acids Res..

[2]  J. Thakar,et al.  Quantitative set analysis for gene expression: a method to quantify gene set differential expression including gene-gene correlations , 2013, Nucleic acids research.

[3]  R. Gentleman,et al.  Independent filtering increases detection power for high-throughput experiments , 2010, Proceedings of the National Academy of Sciences.

[4]  L. Reid,et al.  Proposed methods for testing and selecting the ERCC external RNA controls , 2005, BMC Genomics.

[5]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[6]  C. Tyler-Smith,et al.  Ancient DNA and the rewriting of human history: be sparing with Occam’s razor , 2016, Genome Biology.

[7]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[8]  Harold Pimentel,et al.  Arkas : A package that complements Kallisto for quick, informative *seq analysis , 2016 .

[9]  David P. Kreil,et al.  Assessing technical performance in differential gene expression experiments with external spike-in RNA control ratio mixtures , 2014, Nature Communications.

[10]  Kary A. C. S. Ocaña,et al.  Parallel computing in genomic research: advances and applications , 2015, Advances and applications in bioinformatics and chemistry : AABC.

[11]  Robert Gentleman,et al.  Software for Computing and Annotating Genomic Ranges , 2013, PLoS Comput. Biol..

[12]  Anirban P. Mitra,et al.  A Central Role for Long Non-Coding RNA in Cancer , 2011, Front. Gene..

[13]  Yan Li,et al.  Venom gland transcriptomes of two elapid snakes (Bungarus multicinctus and Naja atra) and evolution of toxin genes , 2011, BMC Genomics.

[14]  C. Begley,et al.  Drug development: Raise standards for preclinical cancer research , 2012, Nature.

[15]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[16]  M. King,et al.  BRCA1 and BRCA2 and the genetics of breast and ovarian cancer. , 2001, Human molecular genetics.

[17]  Conrad Sanderson,et al.  RcppArmadillo: Accelerating R with high-performance C++ linear algebra , 2014, Comput. Stat. Data Anal..

[18]  Tieliu Shi,et al.  Incorporating the human gene annotations in different databases significantly improved transcriptomic and genetic analyses. , 2013, RNA.

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[21]  BMC Bioinformatics , 2005 .

[22]  W. Fan,et al.  Alu distribution and mutation types of cancer genes , 2011, BMC Genomics.

[23]  Simon White,et al.  Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline , 2014, BMC Bioinformatics.

[24]  K. Gerald van den Boogaart,et al.  Analyzing Compositional Data with R , 2013 .

[25]  Mark D. Robinson,et al.  Isoform prefiltering improves performance of count-based methods for analysis of differential transcript usage , 2016, Genome Biology.

[26]  Priscilla S. Markwood,et al.  The Long Tail: Why the Future of Business is Selling Less of More , 2006 .

[27]  Nicola J. Mulder,et al.  From sets to graphs: towards a realistic enrichment analysis of transcriptomic systems , 2011, Bioinform..

[28]  Kathleen F. Kerr,et al.  The External RNA Controls Consortium: a progress report , 2005, Nature Methods.

[29]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[30]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.