Template-Based Models for Genome-Wide Analysis of Next-Generation Sequencing Data at Base-Pair Resolution

ABSTRACT We consider the problem of estimating the genome-wide distribution of nucleosome positions from paired-end sequencing data. We develop a modeling approach based on nonparametric templates to control for the variability along the sequence of read counts associated with nucleosomal DNA due to enzymatic digestion and other sample preparation steps, and we develop a calibrated Bayesian method to detect local concentrations of nucleosome positions. We also introduce a set of estimands that provides rich, interpretable summaries of nucleosome positioning. Inference is carried out via a distributed Hamiltonian Monte Carlo algorithm that can scale linearly with the length of the genome being analyzed. We provide MPI-based Python implementations of the proposed methods, stand-alone and on Amazon EC2, which can provide inferences on an entire Saccharomyces cerevisiae genome in less than 1 hr on EC2. We evaluate the accuracy and reproducibility of the inferences leveraging a factorially designed simulation study and experimental replicates. The template-based approach we develop here is also applicable to single-end sequencing data by using alternative sources of fragment length information, and to ordered and sequential data more generally. It provides a flexible and scalable alternative to mixture models, hidden Markov models, and Parzen-window methods. Supplementary materials for this article are available online.

[1]  E. O’Shea,et al.  A computational approach to map nucleosome positions and alternative chromatin states with base pair resolution , 2016, eLife.

[2]  Lani F. Wu,et al.  Genome-Scale Identification of Nucleosome Positions in S. cerevisiae , 2005, Science.

[3]  Nir Friedman,et al.  Nucleosome positioning from tiling microarray data , 2008, ISMB.

[4]  I. Albert,et al.  Translational and rotational settings of H2A.Z nucleosomes across the Saccharomyces cerevisiae genome , 2007, Nature.

[5]  Armin Schwartzman,et al.  MULTIPLE TESTING OF LOCAL MAXIMA FOR DETECTION OF PEAKS IN CHIP-SEQ DATA. , 2013, The annals of applied statistics.

[6]  Simon Tavaré,et al.  BayesPeak—an R package for analysing ChIP-seq data , 2011, Bioinform..

[7]  Stefano Lonardi,et al.  PuFFIN - a parameter-free method to build nucleosome maps from paired-end reads , 2014, BMC Bioinformatics.

[8]  Daniel J. Gaffney,et al.  Controls of Nucleosome Positioning in the Human Genome , 2012, PLoS genetics.

[9]  M. Daly,et al.  Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). , 2005, Genome research.

[10]  Kristin R Brogaard,et al.  A base pair resolution map of nucleosome positions in yeast , 2012, Nature.

[11]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[12]  Qianxing Mo,et al.  A fully Bayesian hidden Ising model for ChIP-seq data analysis. , 2012, Biostatistics.

[13]  Oliver Müller,et al.  Modeling nucleosome position distributions from experimental nucleosome positioning maps , 2013, Bioinform..

[14]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Ronald W. Davis,et al.  A high-resolution atlas of nucleosome occupancy in yeast , 2007, Nature Genetics.

[16]  Jianxing Feng,et al.  DiNuP: a systematic approach to identify regions of differential nucleosome positioning , 2012, Bioinform..

[17]  J. Widom,et al.  Single-cell nucleosome mapping reveals the molecular basis of gene expression heterogeneity , 2014, Proceedings of the National Academy of Sciences.

[18]  T. Wolfsberg,et al.  DNase-chip: a high-resolution method to identify DNase I hypersensitive sites using tiled microarrays , 2006, Nature Methods.

[19]  Mayetri Gupta,et al.  A continuous-index Bayesian hidden Markov model for prediction of nucleosome positioning in genomic DNA. , 2011, Biostatistics.

[20]  A. Barski,et al.  Genomic location analysis by ChIP‐Seq , 2009, Journal of cellular biochemistry.

[21]  Raymond K. Auerbach,et al.  A User's Guide to the Encyclopedia of DNA Elements (ENCODE) , 2011, PLoS biology.

[22]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[23]  E. Wang,et al.  Analysis and design of RNA sequencing experiments for identifying isoform regulation , 2010, Nature Methods.

[24]  Itay Tirosh,et al.  Computational analysis of nucleosome positioning. , 2012, Methods in molecular biology.

[25]  C. Burge,et al.  Musashi proteins are post-transcriptional regulators of the epithelial-luminal cell state , 2014, bioRxiv.

[26]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[27]  Aviv Regev,et al.  The Role of Nucleosome Positioning in the Evolution of Gene Regulation , 2010, PLoS biology.

[28]  Kevin J. Verstrepen,et al.  Nucleosome Positioning in Saccharomyces cerevisiae , 2011, Microbiology and Molecular Reviews.

[29]  Gordon Robertson,et al.  Probabilistic Inference for Nucleosome Positioning with MNase-Based or Sonicated Short-Read Data , 2012, PloS one.

[30]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[31]  Irene K. Moore,et al.  A genomic code for nucleosome positioning , 2006, Nature.

[32]  J. Ibrahim,et al.  ZINBA integrates local covariates with DNA-seq data to identify broad and narrow regions of enrichment, even within amplified genomic regions , 2011, Genome Biology.

[33]  Steven J. M. Jones,et al.  Dynamic Remodeling of Individual Nucleosomes Across a Eukaryotic Genome in Response to Transcriptional Perturbation , 2007, PLoS biology.

[34]  Quantitative visualization of alternative exon expression from RNA-seq data , 2015, Bioinform..

[35]  Jeffrey T Leek,et al.  Differential expression analysis of RNA-seq data at single-base resolution , 2014, Biostatistics.

[36]  Guo-Cheng Yuan,et al.  Genomic Sequence Is Highly Predictive of Local Nucleosome Depletion , 2007, PLoS Comput. Biol..

[37]  Nir Friedman,et al.  High-resolution nucleosome mapping reveals transcription-dependent promoter packaging. , 2010, Genome research.

[38]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[39]  Geoffrey J. Barton,et al.  A Role for Snf2-Related Nucleosome-Spacing Enzymes in Genome-Wide Nucleosome Organization , 2011, Science.

[40]  A. Mortazavi,et al.  Computation for ChIP-seq and RNA-seq studies , 2009, Nature Methods.

[41]  Mukund Patel,et al.  Improved ChIP-chip analysis by a mixture model approach , 2009, BMC Bioinformatics.

[42]  Peter J. Bickel,et al.  Measuring reproducibility of high-throughput experiments , 2011, 1110.4705.

[43]  Stefano Lonardi,et al.  NOrMAL: accurate nucleosome positioning using a modified Gaussian mixture model , 2012, Bioinform..

[44]  Mayetri Gupta,et al.  Generalized Hierarchical Markov Models for the Discovery of Length‐Constrained Sequence Features from Genome Tiling Arrays , 2007, Biometrics.

[45]  Ker-Chau Li,et al.  Dissecting Nucleosome Free Regions by a Segmental Semi-Markov Model , 2009, PloS one.

[46]  Steven M. Johnson,et al.  A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. , 2008, Genome research.

[47]  Oscar Flores,et al.  nucleR: a package for non-parametric nucleosome positioning , 2011, Bioinform..

[48]  Deepayan Sarkar,et al.  Detecting differential gene expression with a semiparametric hierarchical mixture method. , 2004, Biostatistics.