Tximeta: Reference sequence checksums for provenance identification in RNA-seq

Correct annotation metadata is critical for reproducible and accurate RNA-seq analysis. When files are shared publicly or among collaborators with incorrect or missing annotation metadata, it becomes difficult or impossible to reproduce bioinformatic analyses from raw data. It also makes it more difficult to locate the transcriptomic features, such as transcripts or genes, in their proper genomic context, which is necessary for overlapping expression data with other datasets. We provide a solution in the form of an R/Bioconductor package tximeta that performs numerous annotation and metadata gathering tasks automatically on behalf of users during the import of transcript quantification files. The correct reference transcriptome is identified via a hashed checksum stored in the quantification output, and key transcript databases are downloaded and cached locally. The computational paradigm of automatically adding annotation metadata based on reference sequence checksums can greatly facilitate genomic workflows, by helping to reduce overhead during bioinformatic analyses, preventing costly bioinformatic mistakes, and promoting computational reproducibility. The tximeta package is available at https://bioconductor.org/packages/tximeta.

[1]  Giridharan Ramsingh,et al.  Arkas: Rapid reproducible RNAseq analysis , 2017, F1000Research.

[2]  Robert Gentleman,et al.  Software for Computing and Annotating Genomic Ranges , 2013, PLoS Comput. Biol..

[3]  Joseph G. Ibrahim,et al.  Nonparametric expression analysis using inferential replicate counts , 2019 .

[4]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[5]  John Chilton,et al.  Common Workflow Language, v1.0 , 2016 .

[6]  Rolf Backofen,et al.  Practical computational reproducibility in the life sciences , 2017, bioRxiv.

[7]  Laurent Gatto,et al.  ensembldb: an R package to create and use Ensembl-based annotation resources , 2019, Bioinform..

[8]  Wei Shi,et al.  featureCounts: an efficient general purpose program for assigning sequence reads to genomic features , 2013, Bioinform..

[9]  Donald E. Eastlake,et al.  US Secure Hash Algorithm 1 (SHA1) , 2001, RFC.

[10]  Diana Domanska,et al.  Genome build information is an essential part of genomic track files , 2017, Genome Biology.

[11]  Carole A. Goble,et al.  Using a suite of ontologies for preserving workflow-centric research objects , 2015, J. Web Semant..

[12]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[13]  Mark D. Robinson,et al.  ARMOR: An Automated Reproducible MOdular Workflow for Preprocessing and Differential Analysis of RNA-seq Data , 2019, G3: Genes, Genomes, Genetics.

[14]  F. Collins,et al.  NIH plans to enhance reproducibility , 2014 .

[15]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[16]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[17]  Astrid Gall,et al.  Ensembl 2018 , 2017, Nucleic Acids Res..

[18]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[19]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[20]  Charlotte Soneson,et al.  Tximeta: Reference sequence checksums for provenance identification in RNA-seq , 2019, bioRxiv.

[21]  Avi Srivastava,et al.  Alevin efficiently estimates accurate gene abundances from dscRNA-seq data , 2018, Genome Biology.

[22]  Prasad Patil,et al.  A statistical definition for reproducibility and replicability , 2016, bioRxiv.

[23]  Mark Gerstein,et al.  GENCODE reference annotation for the human and mouse genomes , 2018, Nucleic Acids Res..

[24]  Nathan C. Sheffield,et al.  Refgenie: a reference genome resource manager , 2019, bioRxiv.

[25]  Mark B. Cannell,et al.  Extraction of Sub-microscopic Ca Fluxes from Blurred and Noisy Fluorescent Indicator Images with a Detailed Model Fitting Approach , 2013, PLoS Comput. Biol..

[26]  Deanna M. Church,et al.  Assembly: a resource for assembled genomes at NCBI , 2015, Nucleic Acids Res..

[27]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[28]  Dianne Cook,et al.  plyranges: a grammar of genomic data transformation , 2018, Genome Biology.

[29]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[30]  Ryan Gosselin,et al.  Current RNA-seq methodology reporting limits reproducibility , 2019, Briefings Bioinform..

[31]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[32]  Francis Collins,et al.  Opinion: The Next Generation Researchers Initiative at NIH , 2017, Proceedings of the National Academy of Sciences.

[33]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[34]  Kaur Alasoo,et al.  Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response , 2018, Nature Genetics.

[35]  M. Robinson,et al.  Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences , 2015, F1000Research.

[36]  Carole Goble,et al.  Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv , 2019, GigaScience.

[37]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[38]  F. Collins,et al.  Policy: NIH plans to enhance reproducibility , 2014, Nature.

[39]  Joël Simoneau,et al.  In silico analysis of RNA-seq requires a more complete description of methodology , 2019, Nature Reviews Molecular Cell Biology.