Assessing the consistency of public human tissue RNA-seq data sets

Sequencing-based gene expression methods like RNA-sequencing (RNA-seq) have become increasingly common, but it is often claimed that results obtained in different studies are not comparable owing to the influence of laboratory batch effects, differences in RNA extraction and sequencing library preparation methods and bioinformatics processing pipelines. It would be unfortunate if different experiments were in fact incomparable, as there is great promise in data fusion and meta-analysis applied to sequencing data sets. We therefore compared reported gene expression measurements for ostensibly similar samples (specifically, human brain, heart and kidney samples) in several different RNA-seq studies to assess their overall consistency and to examine the factors contributing most to systematic differences. The same comparisons were also performed after preprocessing all data in a consistent way, eliminating potential bias from bioinformatics pipelines. We conclude that published human tissue RNA-seq expression measurements appear relatively consistent in the sense that samples cluster by tissue rather than laboratory of origin given simple preprocessing transformations. The article is supplemented by a detailed walkthrough with embedded R code and figures.

[1]  Harald Binder,et al.  Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures , 2014, PloS one.

[2]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[3]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[4]  Nuno A. Fonseca,et al.  RNA-Seq Gene Profiling - A Systematic Empirical Comparison , 2014, bioRxiv.

[5]  J. Ioannidis,et al.  Meta-analysis methods for genome-wide association studies and beyond , 2013, Nature Reviews Genetics.

[6]  Sheng Li,et al.  Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study , 2014, Nature Biotechnology.

[7]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[8]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[9]  Ugur Sahin,et al.  RNA-Seq Atlas - a reference database for gene expression profiling in normal tissue by next-generation sequencing , 2012, Bioinform..

[10]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[11]  Jeroen F. J. Laros,et al.  Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2013, Nature Biotechnology.

[12]  Wei Shi,et al.  Detecting and correcting systematic variation in large-scale RNA sequencing data , 2014, Nature Biotechnology.

[13]  S. Bergmann,et al.  The evolution of gene expression levels in mammalian organs , 2011, Nature.

[14]  Krishna R. Kalari,et al.  Detection of redundant fusion transcripts as biomarkers or disease-specific therapeutic targets in breast cancer. , 2012, Cancer research.

[15]  J. Nielsen,et al.  Analysis of the Human Tissue-specific Expression by Genome-wide Integration of Transcriptomics and Antibody-based Proteomics* , 2013, Molecular & Cellular Proteomics.

[16]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[17]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[18]  Nuno A. Fonseca,et al.  RNA-Seq Gene Profiling - A Systematic Empirical Comparison , 2014, bioRxiv.