Tximeta: Reference sequence checksums for provenance identification in RNA-seq

Correct annotation metadata is critical for reproducible and accurate RNA-seq analysis. When files are shared publicly or among collaborators with incorrect or missing annotation metadata, it becomes difficult or impossible to reproduce bioinformatic analyses from raw data. It also makes it more difficult to locate the transcriptomic features, such as transcripts or genes, in their proper genomic context, which is necessary for overlapping expression data with other datasets. We provide a solution in the form of an R/Bioconductor package tximeta that performs numerous annotation and metadata gathering tasks automatically on behalf of users during the import of transcript quantification files. The correct reference transcriptome is identified via a hashed checksum stored in the quantification output, and key transcript databases are downloaded and cached locally. The computational paradigm of automatically adding annotation metadata based on reference sequence checksums can greatly facilitate genomic workflows, by helping to reduce overhead during bioinformatic analyses, preventing costly bioinformatic mistakes, and promoting computational reproducibility. The tximeta package is available at https://bioconductor.org/packages/tximeta.

[1]  Mark Gerstein,et al.  GENCODE reference annotation for the human and mouse genomes , 2018, Nucleic Acids Res..

[2]  Nathan C. Sheffield,et al.  Refgenie: a reference genome resource manager , 2019, bioRxiv.

[3]  John Chilton,et al.  Common Workflow Language, v1.0 , 2016 .

[4]  Giridharan Ramsingh,et al.  Arkas: Rapid reproducible RNAseq analysis , 2017, F1000Research.

[5]  Carole A. Goble,et al.  Using a suite of ontologies for preserving workflow-centric research objects , 2015, J. Web Semant..

[6]  M. Robinson,et al.  Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences , 2015, F1000Research.

[7]  Prasad Patil,et al.  A statistical definition for reproducibility and replicability , 2016, bioRxiv.

[8]  Robert Gentleman,et al.  Software for Computing and Annotating Genomic Ranges , 2013, PLoS Comput. Biol..

[9]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[10]  Ryan Gosselin,et al.  Current RNA-seq methodology reporting limits reproducibility , 2019, Briefings Bioinform..

[11]  Kaur Alasoo,et al.  Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response , 2018, Nature Genetics.

[12]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[13]  F. Collins,et al.  NIH plans to enhance reproducibility , 2014 .

[14]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[15]  Deanna M. Church,et al.  Assembly: a resource for assembled genomes at NCBI , 2015, Nucleic Acids Res..

[16]  Dianne Cook,et al.  plyranges: a grammar of genomic data transformation , 2018, Genome Biology.

[17]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[18]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[19]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[20]  Avi Srivastava,et al.  Alevin efficiently estimates accurate gene abundances from dscRNA-seq data , 2018, Genome Biology.

[21]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[22]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[23]  F. Collins,et al.  Policy: NIH plans to enhance reproducibility , 2014, Nature.

[24]  Joël Simoneau,et al.  In silico analysis of RNA-seq requires a more complete description of methodology , 2019, Nature Reviews Molecular Cell Biology.

[25]  Charlotte Soneson,et al.  Tximeta: Reference sequence checksums for provenance identification in RNA-seq , 2020, PLoS computational biology.

[26]  Rolf Backofen,et al.  Practical computational reproducibility in the life sciences , 2017, bioRxiv.

[27]  Laurent Gatto,et al.  ensembldb: an R package to create and use Ensembl-based annotation resources , 2019, Bioinform..

[28]  Mark D. Robinson,et al.  ARMOR: An Automated Reproducible MOdular Workflow for Preprocessing and Differential Analysis of RNA-seq Data , 2019, G3: Genes, Genomes, Genetics.

[29]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[30]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[31]  Cliburn Chan,et al.  Hierarchical Modeling for Rare Event Detection and Cell Subset Alignment across Flow Cytometry Samples , 2013, PLoS Comput. Biol..

[32]  Carole Goble,et al.  Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv , 2019, GigaScience.

[33]  Donald E. Eastlake,et al.  US Secure Hash Algorithm 1 (SHA1) , 2001, RFC.

[34]  Astrid Gall,et al.  Ensembl 2018 , 2017, Nucleic Acids Res..

[35]  Wei Shi,et al.  featureCounts: an efficient general purpose program for assigning sequence reads to genomic features , 2013, Bioinform..

[36]  Joseph G. Ibrahim,et al.  Nonparametric expression analysis using inferential replicate counts , 2019 .

[37]  Diana Domanska,et al.  Genome build information is an essential part of genomic track files , 2017, Genome Biology.

[38]  Francis Collins,et al.  Opinion: The Next Generation Researchers Initiative at NIH , 2017, Proceedings of the National Academy of Sciences.

[39]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.