Rail-RNA: Scalable analysis of RNA-seq splicing and coverage

Motivation: RNA sequencing (RNA‐seq) experiments now span hundreds to thousands of samples. Current spliced alignment software is designed to analyze each sample separately. Consequently, no information is gained from analyzing multiple samples together, and it requires extra work to obtain analysis products that incorporate data from across samples. Results: We describe Rail‐RNA, a cloud‐enabled spliced aligner that analyzes many samples at once. Rail‐RNA eliminates redundant work across samples, making it more efficient as samples are added. For many samples, Rail‐RNA is more accurate than annotation‐assisted aligners. We use Rail‐RNA to align 667 RNA‐seq samples from the GEUVADIS project on Amazon Web Services in under 16 h for US$0.91 per sample. Rail‐RNA outputs alignments in SAM/BAM format; but it also outputs (i) base‐level coverage bigWigs for each sample; (ii) coverage bigWigs encoding normalized mean and median coverages at each base across samples analyzed; and (iii) exon‐exon splice junctions and indels (features) in columnar formats that juxtapose coverages in samples in which a given feature is found. Supplementary outputs are ready for use with downstream packages for reproducible statistical analysis. We use Rail‐RNA to identify expressed regions in the GEUVADIS samples and show that both annotated and unannotated (novel) expressed regions exhibit consistent patterns of variation across populations and with respect to known confounding variables. Availability and Implementation: Rail‐RNA is open‐source software available at http://rail.bio. Contacts: anellore@gmail.com or langmea@cs.jhu.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[2]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[3]  Rafael A. Irizarry,et al.  derfinder: Software for annotation-agnostic RNA-seq differential expression analysis , 2015 .

[4]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[5]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[6]  E. Hayden Is the $1,000 genome for real? , 2014 .

[7]  J. Harrow,et al.  Systematic evaluation of spliced alignment programs for RNA-seq data , 2013, Nature Methods.

[8]  Ying Cheng,et al.  Improvements to services at the European Nucleotide Archive , 2009, Nucleic Acids Res..

[9]  V. Solovyev,et al.  Analysis of canonical and non-canonical splice sites in mammalian genomes. , 2000, Nucleic acids research.

[10]  Roderic Guigó,et al.  The GEM mapper: fast, accurate and versatile alignment by filtration , 2012, Nature Methods.

[11]  Vipin T. Sreedharan,et al.  RNA‐Seq Read Alignments with PALMapper , 2010, Current protocols in bioinformatics.

[12]  Fatih Ozsolak,et al.  RNA sequencing: advances, challenges and opportunities , 2011, Nature Reviews Genetics.

[13]  Ying Cheng,et al.  The European Nucleotide Archive , 2010, Nucleic Acids Res..

[14]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[15]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[16]  Peter A. Combs,et al.  Low-cost, low-input RNA-seq protocols perform nearly as well as high-input protocols , 2015, PeerJ.

[17]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[18]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[19]  Siu-Ming Yiu,et al.  SOAPsplice: Genome-Wide ab initio Detection of Splice Junctions from RNA-Seq Data , 2011, Front. Gene..

[20]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[21]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[22]  Thomas Bonfert,et al.  A context-based approach to identify the most likely mapping for RNA-seq experiments , 2012, BMC Bioinformatics.

[23]  Weng-Keen Wong,et al.  Gene expression Advance Access publication April 21, 2010 Supersplat—spliced RNA-seq alignment , 2009 .

[24]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[25]  Jeffrey T Leek,et al.  Differential expression analysis of RNA-seq data at single-base resolution , 2014, Biostatistics.

[26]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[27]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[28]  R. Guigó,et al.  Transcriptome genetics using second generation sequencing in a Caucasian population , 2010, Nature.

[29]  Brian P. Brunk,et al.  Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM) , 2011, Bioinform..

[30]  T. Hampton,et al.  The Cancer Genome Atlas , 2020, Indian Journal of Medical and Paediatric Oncology.

[31]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[32]  David Eppstein,et al.  Listing All Maximal Cliques in Sparse Graphs in Near-optimal Time , 2010, Exact Complexity of NP-hard Problems.

[33]  Jun Hu,et al.  OSA: a fast and accurate alignment tool for RNA-Seq , 2012, Bioinform..

[34]  Galt P. Barber,et al.  BigWig and BigBed: enabling browsing of large distributed datasets , 2010, Bioinform..

[35]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[36]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[37]  Alessandro Vullo,et al.  Ensembl 2015 , 2014, Nucleic Acids Res..

[38]  Sean M. Grimmond,et al.  RNA-MATE: a recursive mapping strategy for high-throughput RNA-sequencing data , 2009, Bioinform..

[39]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[40]  W. Shi,et al.  The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote , 2013, Nucleic acids research.

[41]  Andrew E. Jaffe,et al.  GEUVADIS expressed regions coverage matrix , 2015 .

[42]  Y. Xing,et al.  Detection of splice junctions from paired-end RNA-seq data by SpliceMap , 2010, Nucleic acids research.

[43]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[44]  Jeroen F. J. Laros,et al.  Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2013, Nature Biotechnology.

[45]  Nir Ailon,et al.  Aggregating inconsistent information: Ranking and clustering , 2008 .

[46]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[47]  Gunnar Rätsch,et al.  Optimal spliced alignments of short sequence reads , 2008, BMC Bioinformatics.

[48]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[49]  Kai Ye,et al.  PASSion: a pattern growth algorithm-based pipeline for splice junction detection in paired-end RNA-Seq data , 2012, Bioinform..

[50]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[51]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[52]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[53]  R. Guigó,et al.  Modelling and simulating generic RNA-Seq experiments with the flux simulator , 2012, Nucleic acids research.

[54]  Leonardo Collado-Torres,et al.  Developmental regulation of human cortex transcription and its clinical relevance at base resolution , 2014, Nature Neuroscience.

[55]  Eric Rivals,et al.  CRAC: an integrated approach to the analysis of RNA-seq reads , 2013, Genome Biology.

[56]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.