Statistical inference for time course RNA-Seq data using a negative binomial mixed-effect model

BackgroundAccurate identification of differentially expressed (DE) genes in time course RNA-Seq data is crucial for understanding the dynamics of transcriptional regulatory network. However, most of the available methods treat gene expressions at different time points as replicates and test the significance of the mean expression difference between treatments or conditions irrespective of time. They thus fail to identify many DE genes with different profiles across time. In this article, we propose a negative binomial mixed-effect model (NBMM) to identify DE genes in time course RNA-Seq data. In the NBMM, mean gene expression is characterized by a fixed effect, and time dependency is described by random effects. The NBMM is very flexible and can be fitted to both unreplicated and replicated time course RNA-Seq data via a penalized likelihood method. By comparing gene expression profiles over time, we further classify the DE genes into two subtypes to enhance the understanding of expression dynamics. A significance test for detecting DE genes is derived using a Kullback-Leibler distance ratio. Additionally, a significance test for gene sets is developed using a gene set score.ResultsSimulation analysis shows that the NBMM outperforms currently available methods for detecting DE genes and gene sets. Moreover, our real data analysis of fruit fly developmental time course RNA-Seq data demonstrates the NBMM identifies biologically relevant genes which are well justified by gene ontology analysis.ConclusionsThe proposed method is powerful and efficient to detect biologically relevant DE genes and gene sets in time course RNA-Seq data.

[1]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[2]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[3]  G. Wahba Spline models for observational data , 1990 .

[4]  G. Robinson That BLUP is a Good Thing: The Estimation of Random Effects , 1991 .

[5]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[6]  Prof. Dr. José A. Campos-Ortega,et al.  The Embryonic Development of Drosophila melanogaster , 1997, Springer Berlin Heidelberg.

[7]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[8]  Chong Gu Smoothing Spline Anova Models , 2002 .

[9]  Chong Gu Model diagnostics for smoothing spline ANOVA models , 2004 .

[10]  Chong Gu,et al.  Generalized Nonparametric Mixed-Effect Models: Computation and Smoothing Parameter Selection , 2005 .

[11]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[12]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[13]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[14]  I. Goodhead,et al.  Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution , 2008, Nature.

[15]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[16]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[17]  Jun S. Liu,et al.  Identifying Differentially Expressed Genes in Time Course Microarray Data , 2009 .

[18]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[19]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[20]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[21]  C. Elsik The pea aphid genome sequence brings theories of insect defense into question , 2010, Genome Biology.

[22]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[23]  E. Letouzé,et al.  Analysis of the copy number profiles of several tumor samples from the same patient reveals the successive steps in tumorigenesis , 2010, Genome Biology.

[24]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[25]  Hui Jiang,et al.  Statistical Modeling of RNA-Seq Data. , 2011, Statistical science : a review journal of the Institute of Mathematical Statistics.

[26]  B. Graveley The developmental transcriptome of Drosophila melanogaster , 2010, Nature.

[27]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[28]  W. Huber,et al.  Detecting differential usage of exons from RNA-seq data , 2012, Genome research.

[29]  Hongyu Zhao,et al.  Time Series Expression Analyses Using RNA-seq: A Statistical Approach , 2013, BioMed research international.

[30]  P. Dong,et al.  Emerging Therapeutic Biomarkers in Endometrial Cancer , 2013, BioMed research international.

[31]  Christopher B. Burge,et al.  Methods for time series analysis of RNA-seq data with application to human Th17 cell differentiation , 2014, Bioinform..

[32]  Ana Conesa,et al.  Next maSigPro: updating maSigPro bioconductor package for RNA-seq time series , 2014, Bioinform..

[33]  Alyssa C. Frazee,et al.  Polyester: Simulating RNA-Seq Datasets With Differential Transcript Expression , 2014, bioRxiv.