A complete statistical model for calibration of RNA-seq counts using external spike-ins and maximum likelihood theory

A fundamental assumption, common to the vast majority of high-throughput transcriptome analyses, is that the expression of most genes is unchanged among samples and that total cellular RNA remains constant. As the number of analyzed experimental systems increases however, different independent studies demonstrate that this assumption is often violated. We present a calibration method using RNA spike-ins that allows for the measurement of absolute cellular abundance of RNA molecules. We apply the method to pooled RNA from cell populations of known sizes. For each transcript, we compute a nominal abundance that can be converted to absolute by dividing by a scale factor determined in separate experiments: the yield coefficient of the transcript relative to that of a reference spike-in measured with the same protocol. The method is derived by maximum likelihood theory in the context of a complete statistical model for sequencing counts contributed by cellular RNA and spike-ins. The counts are based on a sample from a fixed number of cells to which a fixed population of spike-in molecules has been added. We illustrate and evaluate the method with applications to two global expression data sets, one from the model eukaryote Saccharomyces cerevisiae, proliferating at different growth rates, and differentiating cardiopharyngeal cell lineages in the chordate Ciona robusta. We tested the method in a technical replicate dilution study, and in a k-fold validation study.

[1]  Aleksandra A. Kolodziejczyk,et al.  Accounting for technical noise in single-cell RNA-seq experiments , 2013, Nature Methods.

[2]  M. Salit,et al.  Synthetic Spike-in Standards for Rna-seq Experiments Material Supplemental Open Access License Commons Creative , 2022 .

[3]  S. Dudoit,et al.  Normalization of RNA-seq data using factor analysis of control genes or samples , 2014, Nature Biotechnology.

[4]  C. Waldron,et al.  Effect of growth rate on the amounts of ribosomal and transfer ribonucleic acids in yeast , 1975, Journal of bacteriology.

[5]  David P. Kreil,et al.  Assessing technical performance in differential gene expression experiments with external spike-in RNA control ratio mixtures , 2014, Nature Communications.

[6]  Wei Li,et al.  The overlooked fact : fundamental need of spike-in controls for 2 virtually all genome-wide analyses , 2015 .

[7]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[8]  Christian Cole,et al.  Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment , 2015, Bioinform..

[9]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[10]  John N. Koberstein,et al.  Removal of unwanted variation reveals novel patterns of gene expression linked to sleep homeostasis in murine cortex , 2016, BMC Genomics.

[11]  Guochang Wang,et al.  A Hypothesis Testing Based Method for Normalization and Differential Expression Analysis of RNA-Seq Data , 2017, PloS one.

[12]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[13]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[14]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[15]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[16]  Berthold Göttgens,et al.  Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data , 2017, bioRxiv.

[17]  Li Tang,et al.  Statistical Methods for Overdispersion in mRNA-Seq Count Data , 2013 .

[18]  David A. Orlando,et al.  Revisiting Global Gene Expression Analysis , 2012, Cell.

[19]  Terence P Speed,et al.  RLE plots: Visualizing unwanted variation in high dimensional data , 2017, PloS one.

[20]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[21]  Xiang Zhou,et al.  Differential expression analysis for RNAseq using Poisson mixed models , 2016, bioRxiv.

[22]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[23]  L. Christiaen,et al.  Ciona as a Simple Chordate Model for Heart Development and Regeneration , 2016, Journal of cardiovascular development and disease.

[24]  S. Srivastava,et al.  A two-parameter generalized Poisson model to improve the analysis of RNA-seq data , 2010, Nucleic acids research.

[25]  Sarah C. Emerson,et al.  Marginal likelihood estimation of negative binomial parameters with applications to RNA-seq data. , 2017, Biostatistics.

[26]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[27]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[28]  A. Raj,et al.  Single mammalian cells compensate for differences in cellular volume and DNA copy number through independent global transcriptional mechanisms. , 2015, Molecular cell.

[29]  T. Borodina,et al.  A strand-specific library preparation protocol for RNA sequencing. , 2011, Methods in enzymology.

[30]  Leming Shi,et al.  mRNA enrichment protocols determine the quantification characteristics of external RNA spike-in controls in RNA-Seq studies , 2013, Science China Life Sciences.

[31]  S. Linnarsson,et al.  Counting absolute numbers of molecules using unique molecular identifiers , 2011, Nature Methods.

[32]  Terence P. Speed,et al.  How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets , 2015, Nucleic acids research.

[33]  David P. Kreil,et al.  A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control consortium , 2014, Nature Biotechnology.

[34]  J. Warner,et al.  Coordinate control of syntheses of ribosomal ribonucleic acid and ribosomal proteins during nutritional shift-up in Saccharomyces cerevisiae , 1981, Molecular and cellular biology.

[35]  Matthew J. Brauer,et al.  Coordination of growth rate, cell cycle, stress response, and metabolic activity in yeast. , 2008, Molecular biology of the cell.

[36]  R. Tibshirani,et al.  Normalization, testing, and false discovery rate estimation for RNA-sequencing data. , 2012, Biostatistics.

[37]  A. Novick,et al.  Description of the chemostat. , 1950, Science.

[38]  Jacques Monod,et al.  LA TECHNIQUE DE CULTURE CONTINUE THÉORIE ET APPLICATIONS , 1978 .

[39]  W. Huber,et al.  Differential expression analysis for sequence count data , 2010 .

[40]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[41]  A. Conesa,et al.  Differential expression in RNA-seq: a matter of depth. , 2011, Genome research.

[42]  Jungeui Hong,et al.  Incorporation of unique molecular identifiers in TruSeq adapters improves the accuracy of quantitative sequencing. , 2017, BioTechniques.

[43]  Cole Trapnell,et al.  Improving RNA-Seq expression estimates by correcting for fragment bias , 2011, Genome Biology.

[44]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[45]  W. Wang,et al.  Purification of Fluorescent Labeled Cells from Dissociated Ciona Embryos. , 2018, Advances in experimental medicine and biology.