IEEE 7th BIBE Invited Plenary Keynote: Statistial Analysis of nucleosome occupancy and histone modification data

In eukaryotic cells, genomic DNAs wrap around beadlike molecules, called nucleosomes, so as to pack more compactly in the nucleus of the cell. The nucleosome is made up of four pairs of histone proteins (H2A, H2B, H3, and H4) who share a very similar structural motif. The positioning of nucleosomes as well as the modifications of various sites of histone proteins (such as acetylation) plays important but incompletely understood roles in gene regulation. We propose statistical models for predicting nucleosome positioning and histone modification patterns using only genomic sequence information. Computation models have been developed to predict genome-wide nucleosome positions from DNA sequences, but these models consider only nucleosome sequences, which may have limited their power. We developed a statistical multi-resolution approach to identify a sequence signature, called the N-score, that distinguishes nucleosome binding DNA from non-nucleosome DNA. The N-score is not sensitive to deletion of short DNA elements and can also be estimated reasonably accurately from coarse nucleosome positioning data. We found that the sequence information is highly predictive for local nucleosome enrichment or depletion, whereas the exact positions may be further fine-tuned by other regulatory factors. We observed that many characteristics of nucleosome positioning, such as the nucleosome depletion in the promoter regions, can be predicted accurately by the sequence information through N-scores. In addition to nucleosome positioning, histone acetylations are also important in directing gene regulation. A comprehensive understanding of the regulatory role of histone acetylation is difficult because many different histone acetylation patterns exist and their effects are confounded by other factors, such as the transcription factor binding sequence motif information and nucleosome occupancy. We analyzed recent genomewide histone acetylation data using a few complementary statistical models and tested the validity of a cumulative model in approximating the global regulatory effect of histone acetylation. Confounding effects due to transcription factor binding sequence information were estimated by using two independent motif-based algorithms followed by a variable selection method. Our analysis confirms that histone acetylation has a significant effect on transcription rates in addition to that attributable to upstream sequence motifs. Our model fits well with observed genome-wide data. Strikingly, including more complicated combinatorial effects does not improve the model's performance. Through an analysis of conditional independence, we found that H4 acetylation may not have significant direct impact on global gene expression.