A Unified Model for Joint Normalization and Differential Gene Expression Detection in RNA-Seq Data

The RNA-sequencing (RNA-seq) is becoming increasingly popular for quantifying gene expression levels. Since the RNA-seq measurements are relative in nature, between-sample normalization is an essential step in differential expression (DE) analysis. The normalization step of existing DE detection algorithms is usually ad hoc and performed only once prior to DE detection, which may be suboptimal since ideally normalization should be based on non-DE genes only and thus coupled with DE detection. We propose a unified statistical model for joint normalization and DE detection of RNA-seq data. Sample-specific normalization factors are modeled as unknown parameters in the gene-wise linear models and jointly estimated with the regression coefficients. By imposing sparsity-inducing L1 penalty (or mixed L1/L2 penalty for multiple treatment conditions) on the regression coefficients, we formulate the problem as a penalized least-squares regression problem and apply the augmented Lagrangian method to solve it. Simulation and real data studies show that the proposed model and algorithms perform better than or comparably to existing methods in terms of detection power and false-positive rate. The performance gain increases with increasingly larger sample size or higher signal to noise ratio, and is more significant when a large proportion of genes are differentially expressed in an asymmetric manner.

[1]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[2]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[3]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[4]  Mark D. Robinson,et al.  Robustly detecting differential expression in RNA sequencing data using observation weights , 2013, Nucleic acids research.

[5]  Julia Salzman,et al.  A penalized likelihood approach for robust estimation of isoform expression. , 2013, Statistics and its interface.

[6]  Jonathan R Pollack,et al.  The retinoic acid synthesis gene ALDH1a2 is a candidate tumor suppressor in prostate cancer. , 2005, Cancer research.

[7]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[8]  Wing Hung Wong,et al.  TileMap: create chromosomal map of tiling array hybridizations , 2005, Bioinform..

[9]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[10]  Steven J. M. Jones,et al.  The Molecular Taxonomy of Primary Prostate Cancer , 2015, Cell.

[11]  W. Huber,et al.  Differential expression analysis for sequence count data , 2010 .

[12]  Li Shen,et al.  Joint between-sample normalization and differential expression detection through ℓ0-regularized regression , 2019, BMC Bioinformatics.

[13]  Hongkai Ji,et al.  Analyzing 'omics data using hierarchical models , 2010, Nature Biotechnology.

[14]  A. Oshlack,et al.  Transcript length bias in RNA-seq data confounds systems biology , 2009, Biology Direct.

[15]  Nadia Lalam,et al.  Statistical Applications in Genetics and Molecular Biology , 2007 .

[16]  C. Mason,et al.  Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data , 2013, Genome Biology.

[17]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[18]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[19]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[20]  Hui Jiang,et al.  Statistical Modeling of RNA-Seq Data. , 2011, Statistical science : a review journal of the Institute of Mathematical Statistics.

[21]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[22]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[23]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[24]  Wing Hung Wong,et al.  Statistical inferences for isoform expression in RNA-Seq , 2009, Bioinform..

[25]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[26]  Christoph Wagener,et al.  Down-regulation of CEACAM1 in human prostate cancer: correlation with loss of cell polarity, increased proliferation rate, and Gleason grade 3 to 4 transition. , 2002, Human pathology.

[27]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[28]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[29]  Tianyu Zhan,et al.  Unit-Free and Robust Detection of Differential Expression from RNA-Seq Data , 2014, Statistics in Biosciences.

[30]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[31]  Colin N. Dewey,et al.  RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[32]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[33]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[34]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[35]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.