A graph-based algorithm for RNA-seq data normalization

The use of RNA-sequencing has garnered much attention in recent years for characterizing and understanding various biological systems. However, it remains a major challenge to gain insights from a large number of RNA-seq experiments collectively, due to the normalization problem. Normalization has been challenging due to an inherent circularity, requiring that RNA-seq data be normalized before any pattern of differential (or non-differential) expression can be ascertained; meanwhile, the prior knowledge of non-differential transcripts is crucial to the normalization process. Some methods have successfully overcome this problem by the assumption that most transcripts are not differentially expressed. However, when RNA-seq profiles become more abundant and heterogeneous, this assumption fails to hold, leading to erroneous normalization. We present a normalization procedure that does not rely on this assumption, nor prior knowledge about the reference transcripts. This algorithm is based on a graph constructed from intrinsic correlations among RNA-seq transcripts and seeks to identify a set of densely connected vertices as references. Application of this algorithm on our synthesized validation data showed that it could recover the reference transcripts with high precision, thus resulting in high-quality normalization. On a realistic data set from the ENCODE project, this algorithm gave good results and could finish in a reasonable time. These preliminary results imply that we may be able to break the long persisting circularity problem in RNA-seq normalization.

[1]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[2]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[3]  K. Hansen,et al.  Removing technical variability in RNA-seq data using conditional quantile normalization , 2012, Biostatistics.

[4]  Thomas R. Gingeras,et al.  Comparison of the transcriptional landscapes between human and mouse tissues , 2014, Proceedings of the National Academy of Sciences.

[5]  B. Oliver,et al.  Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster , 2016, BMC Genomics.

[6]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[7]  Ulrich Bodenhofer,et al.  APCluster: an R package for affinity propagation clustering , 2011, Bioinform..

[8]  Li Du,et al.  multiDE: a dimension reduced model based statistical method for differential expression analysis using RNA-sequencing data with multiple treatment conditions , 2016, BMC Bioinformatics.

[9]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[10]  Washington Seattle An integrated encyclopedia of DNA elements in the human genome , 2016 .

[11]  S. Horvath,et al.  Statistical Applications in Genetics and Molecular Biology , 2011 .

[12]  S. Dudoit,et al.  Normalization of RNA-seq data using factor analysis of control genes or samples , 2014, Nature Biotechnology.

[13]  Tun-Wen Pai,et al.  Gene expression rate comparison for multiple high-throughput datasets. , 2013, IET systems biology.

[14]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[15]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[16]  Koji Kadota,et al.  TCC: an R package for comparing tag count data with robust normalization strategies , 2013, BMC Bioinformatics.

[17]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[18]  Limsoon Wong,et al.  Why Batch Effects Matter in Omics Data, and How to Avoid Them. , 2017, Trends in biotechnology.

[19]  Alicia Oshlack,et al.  miRNA-Seq normalization comparisons need improvement. , 2013, RNA.

[20]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[21]  K Dheda,et al.  Real-time RT-PCR normalisation; strategies and considerations , 2005, Genes and Immunity.

[22]  Yasubumi Sakakibara,et al.  DEclust: A statistical approach for obtaining differential expression profiles of multiple conditions , 2017, PloS one.

[23]  Terence P. Speed,et al.  The Role of Spike-In Standards in the Normalization of RNA-seq , 2014 .

[24]  Wei Li,et al.  The overlooked fact : fundamental need of spike-in controls for 2 virtually all genome-wide analyses , 2015 .

[25]  Michael I. Love,et al.  RNA-Seq workflow: gene-level exploratory analysis and differential expression [version 2; referees: 2 approved] , 2016 .

[26]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[27]  M. Salit,et al.  Synthetic Spike-in Standards for Rna-seq Experiments Material Supplemental Open Access License Commons Creative , 2022 .

[28]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[29]  Johanna Hardin,et al.  Selecting between‐sample RNA‐Seq normalization methods from the perspective of their assumptions , 2016, Briefings Bioinform..

[30]  Yoav Gilad,et al.  A reanalysis of mouse ENCODE comparative gene expression data , 2015, F1000Research.

[31]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[32]  Wolfgang Huber,et al.  RNA-Seq workflow: gene-level exploratory analysis and differential expression , 2015, F1000Research.

[33]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[34]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[35]  F. Speleman,et al.  Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes , 2002, Genome Biology.

[36]  David Eppstein,et al.  Listing All Maximal Cliques in Sparse Graphs in Near-optimal Time , 2010, Exact Complexity of NP-hard Problems.

[37]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[38]  R. Tibshirani,et al.  Normalization, testing, and false discovery rate estimation for RNA-sequencing data. , 2012, Biostatistics.

[39]  Lin Song,et al.  Comparison of co-expression measures: mutual information, correlation, and model based indices , 2012, BMC Bioinformatics.

[40]  Yoav Gilad,et al.  Data files and codes used in the reanalysis of the mouse encode comparative gene expression data , 2015 .

[41]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[42]  John Quackenbush,et al.  Smooth Quantile Normalization , 2016, bioRxiv.

[43]  Tun-Wen Pai,et al.  Gene Ontology based housekeeping gene selection for RNA-seq normalization. , 2014, Methods.

[44]  Koji Kadota,et al.  A normalization strategy for comparing tag count data , 2012, Algorithms for Molecular Biology.

[45]  Sarah C. Emerson,et al.  Identifying stably expressed genes from multiple RNA-Seq data sets , 2016, PeerJ.