Uncovering correlated variability in epigenomic datasets using the Karhunen-Loeve transform

BackgroundLarger variation exists in epigenomes than in genomes, as a single genome shapes the identity of multiple cell types. With the advent of next-generation sequencing, one of the key problems in computational epigenomics is the poor understanding of correlations and quantitative differences between large scale data sets.ResultsHere we bring to genomics a scenario of functional principal component analysis, a finite Karhunen-Loève transform, and explicitly decompose the variation in the coverage profiles of 27 chromatin mark ChIP-seq datasets at transcription start sites for H1, one of the most used human embryonic stem cell lines. Using this approach we identify positive correlations between H3K4me3 and H3K36me3, as well as between H3K9ac and H3K36me3, so far undetected by the most commonly used Pearson correlation between read enrichment coverages. We uncover highly negative correlations between H2A.Z, H3K4me3, and several histone acetylation marks, but these occur only between principal components of first and second order. We also demonstrate that levels of gene expression correlate significantly with scores of components of order higher than one, demonstrating that transcriptional regulation by histone marks escapes simple one-to-one relationships. This correlations were higher in significance and magnitude in protein coding genes than in non-coding RNAs.ConclusionsIn summary, we present a methodology to explore and uncover novel patterns of epigenomic variability and covariability in genomic data sets by using a functional eigenvalue decomposition of genomic data. R code is available at: http://github.com/pmb59/KLTepigenome.

[1]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[2]  C. Bock Analysing and interpreting DNA methylation data , 2012, Nature Reviews Genetics.

[3]  Hongkai Ji,et al.  PolyaPeak: Detecting Transcription Factor Binding Sites from ChIP-seq Using Peak Shape Information , 2014, PloS one.

[4]  M. C. Aguilera-Morillo,et al.  Penalized Spline Approaches for Functional Principal Component Logit Regression , 2011 .

[5]  T. Mikkelsen,et al.  The NIH Roadmap Epigenomics Mapping Consortium , 2010, Nature Biotechnology.

[6]  Clifford A. Meyer,et al.  Identifying and mitigating bias in next-generation sequencing methods for chromatin biology , 2014, Nature Reviews Genetics.

[7]  C. Thermes,et al.  Ten years of next-generation sequencing technology. , 2014, Trends in genetics : TIG.

[8]  Guido Sanguinetti,et al.  MMDiff: quantitative testing for shape changes in ChIP-Seq data sets , 2013, BMC Genomics.

[9]  Frdric Ferraty,et al.  Recent Advances in Functional Data Analysis and Related Topics , 2013 .

[10]  F. Ferraty,et al.  The Oxford Handbook of Functional Data Analysis , 2011, Oxford Handbooks Online.

[11]  Shahin Rafii,et al.  Histone variant H2A.X deposition pattern serves as a functional epigenetic mark for distinguishing the developmental potentials of iPSCs. , 2014, Cell stem cell.

[12]  Rory Stark,et al.  Impact of artifact removal on ChIP quality metrics in ChIP-seq and ChIP-exo data , 2014, Front. Genet..

[13]  James B. Brown,et al.  Developmental roles of 21 Drosophila transcription factors are determined by quantitative differences in binding to an overlapping set of thousands of genomic regions , 2009, Genome Biology.

[14]  Guido Sanguinetti,et al.  M3D: a kernel-based test for spatially correlated changes in methylation profiles , 2014, Bioinform..

[15]  Javier F. Palatnik,et al.  Dynamics of chromatin accessibility and gene regulation by MADS-domain transcription factors in flower development , 2014, Genome Biology.

[16]  Aaron C. Daugherty,et al.  H3K4me3 Breadth Is Linked to Cell Identity and Transcriptional Consistency , 2014, Cell.

[17]  Luca Pinello,et al.  Combinatorial assembly of developmental stage-specific enhancers controls gene expression programs during human erythropoiesis. , 2012, Developmental cell.

[18]  Wei Li,et al.  RSeQC: quality control of RNA-seq experiments , 2012, Bioinform..

[19]  M. Esteller,et al.  Epigenetic modifications and human disease , 2010, Nature Biotechnology.

[20]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[21]  Guillaume J. Filion,et al.  Systematic Protein Location Mapping Reveals Five Principal Chromatin Types in Drosophila Cells , 2010, Cell.

[22]  J. McPherson,et al.  A defining decade in DNA sequencing , 2014, Nature Methods.

[23]  Dirk Schübeler,et al.  Tackling the epigenome: challenges and opportunities for collaboration , 2010, Nature Biotechnology.

[24]  Howard Y. Chang,et al.  Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position , 2013, Nature Methods.

[25]  G. Hon,et al.  Next-generation genomics: an integrative approach , 2010, Nature Reviews Genetics.

[26]  Magnus Rattray,et al.  Relationship between genome and epigenome - challenges and requirements for future research , 2014, BMC Genomics.

[27]  A. Milosavljevic Emerging patterns of epigenomic variation. , 2011, Trends in genetics : TIG.

[28]  Manolis Kellis,et al.  ChromHMM: automating chromatin-state discovery and characterization , 2012, Nature Methods.

[29]  Julia Lasserre,et al.  Finding Associations among Histone Modifications Using Sparse Partial Correlation Networks , 2013, PLoS Comput. Biol..

[30]  C. Thermes,et al.  Library preparation methods for next-generation sequencing: tone down the bias. , 2014, Experimental cell research.

[31]  Korbinian Schneeberger,et al.  Combinatorial activities of SHORT VEGETATIVE PHASE and FLOWERING LOCUS C define distinct modes of flowering regulation in Arabidopsis , 2015, Genome Biology.

[32]  Azedine Zoufir,et al.  Human Genome Replication Proceeds through Four Chromatin States , 2013, PLoS Comput. Biol..

[33]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[34]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[35]  Edwin Smith,et al.  The Language of Histone Crosstalk , 2010, Cell.

[36]  Bing Ren,et al.  ChromaSig: A Probabilistic Approach to Finding Common Chromatin Signatures in the Human Genome , 2008, PLoS Comput. Biol..

[37]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[38]  William Stafford Noble,et al.  Integrative annotation of chromatin elements from ENCODE data , 2012, Nucleic acids research.

[39]  Stefano de Pretis,et al.  Computational and experimental methods to decipher the epigenetic code , 2014, Front. Genet..

[40]  G. Crawford,et al.  DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. , 2010, Cold Spring Harbor protocols.

[41]  Thomas Lengauer,et al.  Comprehensive Analysis of DNA Methylation Data with RnBeads , 2014, Nature Methods.

[42]  Jens Bollerslev,et al.  Shape information from glucose curves: Functional data analysis compared with traditional summary measures , 2013, BMC Medical Research Methodology.

[43]  Karen L. Mohlke,et al.  A map of open chromatin in human pancreatic islets , 2010, Nature Genetics.

[44]  Jie Zhang,et al.  Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data , 2013, PLoS Comput. Biol..

[45]  B. Pugh,et al.  Comprehensive Genome-wide Protein-DNA Interactions Detected at Single-Nucleotide Resolution , 2011, Cell.

[46]  Danny Reinberg,et al.  Histones: annotating chromatin. , 2009, Annual review of genetics.

[47]  Manolis Kellis,et al.  Discovery and characterization of chromatin states for systematic annotation of the human genome , 2010, Nature Biotechnology.

[48]  Manolis Kellis,et al.  Large-scale epigenome imputation improves data quality and disease variant enrichment , 2015, Nature Biotechnology.

[49]  Caroline F Finch,et al.  Applications of functional data analysis: A systematic review , 2013, BMC Medical Research Methodology.

[50]  C. Allis,et al.  The language of covalent histone modifications , 2000, Nature.

[51]  Alicja Szabelska,et al.  Preferred analysis methods for single genomic regions in RNA sequencing revealed by processing the shape of coverage , 2011, Nucleic acids research.

[52]  I. Macaulay,et al.  Single Cell Genomics: Advances and Future Perspectives , 2014, PLoS genetics.

[53]  Schraga Schwartz,et al.  Detection and Removal of Biases in the Analysis of Next-Generation Sequencing Reads , 2011, PloS one.

[54]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .

[55]  Marco-Antonio Mendoza-Parra,et al.  Characterising ChIP-seq binding patterns by model-based peak shape deconvolution , 2013, BMC Genomics.

[56]  William Stafford Noble,et al.  Unsupervised pattern discovery in human chromatin structure through genomic segmentation , 2012, Nature Methods.

[57]  Hans-Georg Müller,et al.  Functional Data Analysis , 2016 .

[58]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[59]  Viviana I. Risca,et al.  Unraveling the 3D genome: genomics tools for multiscale exploration. , 2015, Trends in genetics : TIG.

[60]  Jian Zhou,et al.  Global Quantitative Modeling of Chromatin Factor Interactions , 2014, PLoS Comput. Biol..