VARUS: sampling complementary RNA reads from the sequence read archive

BackgroundVast amounts of next generation sequencing RNA data has been deposited in archives, accompanying very diverse original studies. The data is readily available also for other purposes such as genome annotation or transcriptome assembly. However, selecting a subset of available experiments, sequencing runs and reads for this purpose is a nontrivial task and complicated by the inhomogeneity of the data.ResultsThis article presents the software VARUS that selects, downloads and aligns reads from NCBI’s Sequence Read Archive, given only the species’ binomial name and genome. VARUS automatically chooses runs from among all archived runs to randomly select subsets of reads. The objective of its online algorithm is to cover a large number of transcripts adequately when network bandwidth and computing resources are limited. For most tested species VARUS achieved both a higher sensitivity and specificity with a lower number of downloaded reads than when runs were manually selected. At the example of twelve eukaryotic genomes, we show that RNA-Seq that was sampled with VARUS is well-suited for fully-automatic genome annotation with BRAKER.ConclusionsWith VARUS, genome annotation can be automatized to the extent that not even the selection and quality control of RNA-Seq has to be done manually. This introduces the possibility to have fully automatized genome annotation loops over potentially many species without incurring a loss of accuracy over a manually supervised annotation process.

[1]  Juliana Costa-Silva,et al.  RNA-Seq differential expression analysis: An extended review and a software tool , 2017, PloS one.

[2]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[3]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[4]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[5]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[6]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[7]  Katharina J. Hoff,et al.  BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS , 2016, Bioinform..

[8]  Rolf Backofen,et al.  CopraRNA and IntaRNA: predicting small RNA targets, networks and interaction domains , 2014, Nucleic Acids Res..

[9]  David Haussler,et al.  Using native and syntenically mapped cDNA alignments to improve de novo gene finding , 2008, Bioinform..

[10]  Gordon Gremme,et al.  GenomeTools: A Comprehensive Software Library for Efficient Processing of Structured Genome Annotations , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Michael R. Brent,et al.  Eval: A software package for analysis of genome annotations , 2003, BMC Bioinformatics.

[12]  Mario Stanke,et al.  Predicting Genes in Single Genomes with AUGUSTUS , 2018, Current protocols in bioinformatics.

[13]  Takeru Nakazato,et al.  Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive , 2017, GigaScience.

[14]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[15]  Ying Cheng,et al.  The European Nucleotide Archive , 2010, Nucleic Acids Res..

[16]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[17]  M. Borodovsky,et al.  Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm , 2014, Nucleic acids research.