Heritability estimation and differential analysis of count data with generalized linear mixed models in genomic sequencing studies

Motivation Genomic sequencing studies, including RNA sequencing and bisulfite sequencing studies, are becoming increasingly common and increasingly large. Large genomic sequencing studies open doors for accurate molecular trait heritability estimation and powerful differential analysis. Heritability estimation and differential analysis in sequencing studies requires the development of statistical methods that can properly account for the count nature of the sequencing data and that are computationally efficient for large datasets. Results Here, we develop such a method, PQLseq (Penalized Quasi‐Likelihood for sequencing count data), to enable effective and efficient heritability estimation and differential analysis using the generalized linear mixed model framework. With extensive simulations and comparisons to previous methods, we show that PQLseq is the only method currently available that can produce unbiased heritability estimates for sequencing count data. In addition, we show that PQLseq is well suited for differential analysis in large sequencing studies, providing calibrated type I error control and more power compared to the standard linear mixed model methods. Finally, we apply PQLseq to perform gene expression heritability estimation and differential expression analysis in a large RNA sequencing study in the Hutterites. Availability and implementation PQLseq is implemented as an R package with source code freely available at www.xzlab.org/software.html and https://cran.r‐project.org/web/packages/PQLseq/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Wei Li,et al.  MOABS: model based analysis of bisulfite sequencing data , 2014, Genome Biology.

[2]  John D. Blischak,et al.  Methylation QTLs Are Associated with Coordinated Changes in Transcription Factor Binding, Histone Modifications, and Gene Expression Levels , 2014, bioRxiv.

[3]  L. Almasy,et al.  Multipoint quantitative-trait linkage analysis in general pedigrees. , 1998, American journal of human genetics.

[4]  D. Koller,et al.  Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals , 2013, Genome research.

[5]  Alkes L. Price,et al.  Single-Tissue and Cross-Tissue Heritability of Gene Expression Via Identity-by-Descent in Related or Unrelated Individuals , 2011, PLoS genetics.

[6]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[7]  Harvey Goldstein,et al.  Improved Approximations for Multilevel Models with Binary Responses , 1996 .

[8]  Xiang Zhou,et al.  Polygenic Modeling with Bayesian Sparse Linear Mixed Models , 2012, PLoS genetics.

[9]  P. Sullivan,et al.  Heritability and Genomics of Gene Expression in Peripheral Blood , 2014, Nature Genetics.

[10]  D Y Lin,et al.  Improving the power of association tests for quantitative traits in family studies , 2006, Genetic epidemiology.

[11]  P. Visscher,et al.  Pitfalls of predicting complex traits from SNPs , 2013, Nature Reviews Genetics.

[12]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[13]  Roberto M. Lang,et al.  Integrated analyses of gene expression and genetic association studies in a founder population , 2016, Human molecular genetics.

[14]  Daniel J. Gaffney,et al.  A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[15]  Gunnar Rätsch,et al.  DNA methylation in Arabidopsis has a genetic basis and shows evidence of local adaptation , 2015, eLife.

[16]  Nicholas G Martin,et al.  Contribution of genetic variation to transgenerational inheritance of DNA methylation , 2014, Genome Biology.

[17]  Aviv Regev,et al.  Genetic determinants of co-accessible chromatin regions in T cell activation across humans , 2016, bioRxiv.

[18]  J. Byrd,et al.  DNA methylation dynamics during B cell maturation underlie a continuum of disease phenotypes in chronic lymphocytic leukemia , 2016, Nature Genetics.

[19]  S. Redline,et al.  Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies via Logistic Mixed Models. , 2016, American journal of human genetics.

[20]  G. Abecasis,et al.  A general test of association for quantitative traits in nuclear families. , 2000, American journal of human genetics.

[21]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[22]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[23]  G. Barton,et al.  How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? , 2015, RNA.

[24]  Ying Liu,et al.  FaST linear mixed models for genome-wide association studies , 2011, Nature Methods.

[25]  Hae Kyung Im,et al.  Survey of the Heritability and Sparse Architecture of Gene Expression Traits across Human Tissues , 2016, bioRxiv.

[26]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[27]  Colin D. Meiklejohn,et al.  Genome-Wide Gene Expression Effects of Sex Chromosome Imprinting in Drosophila , 2013, G3: Genes, Genomes, Genetics.

[28]  Robin M. Murray,et al.  Epigenome-Wide Scans Identify Differentially Methylated Regions for Age and Age-Related Phenotypes in a Healthy Ageing Population , 2012, PLoS genetics.

[29]  Joseph K. Pickrell,et al.  Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.

[30]  W. G. Hill,et al.  Heritability in the genomics era — concepts and misconceptions , 2008, Nature Reviews Genetics.

[31]  Xihong Lin Estimation using penalized quasilikelihood and quasi-pseudo-likelihood in Poisson mixed models , 2007, Lifetime data analysis.

[32]  N. Breslow,et al.  Approximate inference in generalized linear mixed models , 1993 .

[33]  Sara Martino,et al.  Animal Models and Integrated Nested Laplace Approximations , 2013, G3: Genes, Genomes, Genetics.

[34]  E. Schadt,et al.  Genetic inheritance of gene expression in human cell lines. , 2004, American journal of human genetics.

[35]  C. Amos Robust variance-components approach for assessing genetic linkage in pedigrees. , 1994, American journal of human genetics.

[36]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[37]  Fred A. Wright,et al.  A powerful and flexible approach to the analysis of RNA sequence count data , 2011, Bioinform..

[38]  Xiaoping Zhou A Unified Framework for Variance Component Estimation with Summary Statistics in Genome-wide Association Studies , 2016, bioRxiv.

[39]  Woncheol Jang,et al.  A Numerical Study of PQL Estimation Biases in Generalized Linear Mixed Models Under Heterogeneity of Random Effects , 2009, Commun. Stat. Simul. Comput..

[40]  Xiang Zhou,et al.  Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models , 2017, Nature Communications.

[41]  G. Barton,et al.  Erratum: How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? , 2016, RNA.

[42]  Robin Thompson,et al.  Average information REML: An efficient algorithm for variance parameter estimation in linear mixed models , 1995 .

[43]  N. Breslow,et al.  Bias correction in generalised linear mixed models with a single component of dispersion , 1995 .

[44]  Lei Zhang,et al.  Negative binomial mixed models for analyzing microbiome count data , 2017, BMC Bioinformatics.

[45]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[46]  J. Wakefield,et al.  Bayesian inference for generalized linear mixed models. , 2010, Biostatistics.

[47]  R. Myers,et al.  Gender-Specific Gene Expression in Post-Mortem Human Brain: Localization to Sex Chromosomes , 2004, Neuropsychopharmacology.

[48]  Eran Halperin,et al.  Association testing of bisulfite-sequencing methylation data via a Laplace approximation , 2017, Bioinform..

[49]  H. Stefánsson,et al.  Genetics of gene expression and its effect on disease , 2008, Nature.

[50]  N. Breslow,et al.  Bias Correction in Generalized Linear Mixed Models with Multiple Components of Dispersion , 1996 .

[51]  William J. Browne,et al.  Bayesian and likelihood-based methods in multilevel modeling 1 A comparison of Bayesian and likelihood-based methods for fitting multilevel models , 2006 .

[52]  Laura J. Scott,et al.  The genetic regulatory signature of type 2 diabetes in human skeletal muscle , 2016, Nature Communications.

[53]  D. Gianola,et al.  Genomic Heritability: What Is It? , 2014, PLoS genetics.

[54]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[55]  N. Goldman,et al.  Improved estimation procedures for multilevel models with binary response: a case‐study , 2001 .

[56]  D. Absher,et al.  A Flexible, Efficient Binomial Mixed Model for Identifying Differential DNA Methylation in Bisulfite Sequencing Data , 2015, bioRxiv.

[57]  A. Feinberg,et al.  Genome-wide methylation analysis of human colon cancer reveals similar hypo- and hypermethylation at conserved tissue-specific CpG island shores , 2008, Nature Genetics.

[58]  Matthew Stephens,et al.  The genetic architecture of gene expression levels in wild baboons , 2014, bioRxiv.

[59]  Xiang Zhou,et al.  Differential expression analysis for RNAseq using Poisson mixed models , 2016, bioRxiv.