Normalization of RNA-seq data using factor analysis of control genes or samples

Normalization of RNA-sequencing (RNA-seq) data has proven essential to ensure accurate inference of expression levels. Here, we show that usual normalization approaches mostly account for sequencing depth and fail to correct for library preparation and other more complex unwanted technical effects. We evaluate the performance of the External RNA Control Consortium (ERCC) spike-in controls and investigate the possibility of using them directly for normalization. We show that the spike-ins are not reliable enough to be used in standard global-scaling or regression-based normalization procedures. We propose a normalization strategy, called remove unwanted variation (RUV), that adjusts for nuisance technical effects by performing factor analysis on suitable sets of control genes (e.g., ERCC spike-ins) or samples (e.g., replicate libraries). Our approach leads to more accurate estimates of expression fold-changes and tests of differential expression compared to state-of-the-art normalization methods. In particular, RUV promises to be valuable for large collaborative projects involving multiple laboratories, technicians, and/or sequencing platforms.

[1]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[2]  W. Cleveland,et al.  Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[3]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[4]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[5]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[6]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[7]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[8]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[9]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[10]  Kathleen F. Kerr,et al.  The External RNA Controls Consortium: a progress report , 2005, Nature Methods.

[11]  Alicia Oshlack,et al.  Normalization of boutique two-color microarrays with a high proportion of differentially expressed probes , 2007, Genome Biology.

[12]  Catalin C. Barbacioru,et al.  Evaluation of DNA microarray results with quantitative gene expression platforms , 2006, Nature Biotechnology.

[13]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[14]  Joshua M. Korn,et al.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2008, Nature.

[15]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[16]  Monica Chiogna,et al.  A modified LOESS normalization applied to microRNA arrays: a comparative evaluation , 2009, Bioinform..

[17]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[18]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[19]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[20]  David Heckerman,et al.  Correction for hidden confounders in the genetic analysis of gene expression , 2010, Proceedings of the National Academy of Sciences.

[21]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[22]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[23]  M. Salit,et al.  Synthetic Spike-in Standards for Rna-seq Experiments Material Supplemental Open Access License Commons Creative , 2022 .

[24]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[25]  F. Tang,et al.  Development and applications of single-cell transcriptome analysis , 2011, Nature Methods.

[26]  K. Hansen,et al.  Removing technical variability in RNA-seq data using conditional quantile normalization , 2012, Biostatistics.

[27]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[28]  David A. Orlando,et al.  Revisiting Global Gene Expression Analysis , 2012, Cell.

[29]  Yu Zhu,et al.  Systematic comparison of RNA-Seq normalization methods using measurement error models , 2012, Bioinform..

[30]  Gautier Koscielny,et al.  Ensembl 2012 , 2011, Nucleic Acids Res..

[31]  Aleksandra A. Kolodziejczyk,et al.  Accounting for technical noise in single-cell RNA-seq experiments , 2013, Nature Methods.

[32]  Leming Shi,et al.  mRNA enrichment protocols determine the quantification characteristics of external RNA spike-in controls in RNA-Seq studies , 2013, Science China Life Sciences.

[33]  Gordon K Smyth,et al.  The use of miRNA microarrays for the analysis of cancer samples with global miRNA decrease , 2013, RNA.

[34]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[35]  Jeroen F. J. Laros,et al.  Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2013, Nature Biotechnology.

[36]  David P. Kreil,et al.  A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control consortium , 2014, Nature Biotechnology.

[37]  Terence P. Speed,et al.  Silencing of Odorant Receptor Genes by G Protein βγ Signaling Ensures the Expression of One Odorant Receptor per Olfactory Sensory Neuron , 2014, Neuron.

[38]  Terence P. Speed,et al.  Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed , 2012, Biostatistics.