A reanalysis of mouse ENCODE comparative gene expression data

Recently, the Mouse ENCODE Consortium reported that comparative gene expression data from human and mouse tend to cluster more by species rather than by tissue. This observation was surprising, as it contradicted much of the comparative gene regulatory data collected previously, as well as the common notion that major developmental pathways are highly conserved across a wide range of species, in particular across mammals. Here we show that the Mouse ENCODE gene expression data were collected using a flawed study design, which confounded sequencing batch (namely, the assignment of samples to sequencing flowcells and lanes) with species. When we account for the batch effect, the corrected comparative gene expression data from human and mouse tend to cluster by tissue, not by species.

[1]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[2]  Wei Shi,et al.  featureCounts: an efficient general purpose program for assigning sequence reads to genomic features , 2013, Bioinform..

[3]  Esther T. Chan,et al.  Conservation of core gene expression in vertebrate tissues , 2009, Journal of biology.

[4]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[5]  Shane J. Neph,et al.  A comparative encyclopedia of DNA elements in the mouse genome , 2014, Nature.

[6]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[7]  Michael D. Wilson,et al.  The Evolutionary Landscape of Alternative Splicing in Vertebrate Species , 2012, Science.

[8]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[9]  Jeffrey T Leek,et al.  On the design and analysis of gene expression studies in human populations , 2007, Nature Genetics.

[10]  Matthew D. Rasmussen,et al.  Phylogenetic Identification and Functional Characterization of Orthologs and Paralogs across Human, Mouse, Fly, and Worm , 2014, bioRxiv.

[11]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[12]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[13]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[14]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[15]  S. Bergmann,et al.  The evolution of gene expression levels in mammalian organs , 2011, Nature.

[16]  Jeroen F. J. Laros,et al.  Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2013, Nature Biotechnology.

[17]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[18]  Wei Shi,et al.  Detecting and correcting systematic variation in large-scale RNA sequencing data , 2014, Nature Biotechnology.

[19]  Thomas R. Gingeras,et al.  Comparison of the transcriptional landscapes between human and mouse tissues , 2014, Proceedings of the National Academy of Sciences.

[20]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[21]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.