Unsupervised pattern discovery in human chromatin structure through genomic segmentation

Sequence census methods like ChIP-seq now produce an unprecedented amount of genome-anchored data. We have developed an integrative method to identify patterns from multiple experiments simultaneously while taking full advantage of high-resolution data, discovering joint patterns across different assay types. We apply this method to ENCODE chromatin data for the human chronic myeloid leukemia cell line K562, including ChIP-seq data on covalent histone modifications and transcription factor binding, and DNase-seq and FAIRE-seq readouts of open chromatin. In an unsupervised fashion, we identify patterns associated with transcription start sites, gene ends, enhancers, CTCF elements, and repressed regions. The method yields a model which elucidates the relationship between assay observations and functional elements in the genome. This model identifies sequences likely to affect transcription, and we verify these predictions in laboratory experiments. We have made software and an integrative genome browser track freely available (noble.gs.washington.edu/proj/segway/).

[1]  Amos Tanay,et al.  Spatial Clustering of Multivariate Genomic and Epigenomic Information , 2009, RECOMB.

[2]  Xin Chen,et al.  TRANSFAC: an integrated system for gene expression regulation , 2000, Nucleic Acids Res..

[3]  Manolis Kellis,et al.  Discovery and characterization of chromatin states for systematic annotation of the human genome , 2010, Nature Biotechnology.

[4]  Nancy R. Zhang,et al.  Subsampling methods for genomic inference , 2010, 1101.0947.

[5]  N. L. Johnson,et al.  Systems of frequency curves generated by methods of translation. , 1949, Biometrika.

[6]  Bing Ren,et al.  ChromaSig: A Probabilistic Approach to Finding Common Chromatin Signatures in the Human Genome , 2008, PLoS Comput. Biol..

[7]  Lovelace J. Luquette,et al.  Comprehensive analysis of the chromatin landscape in Drosophila , 2010, Nature.

[8]  William Stafford Noble,et al.  FIMO: scanning for occurrences of a given motif , 2011, Bioinform..

[9]  William Noble Grundy,et al.  Meta-MEME: motif-based hidden Markov models of protein families , 1997, Comput. Appl. Biosci..

[10]  William Stafford Noble,et al.  Unsupervised segmentation of continuous genomic data , 2007, Bioinform..

[11]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[12]  Chandra Erdman,et al.  A fast Bayesian change point analysis for the segmentation of microarray data , 2008, Bioinform..

[13]  María Martín,et al.  Ongoing and future developments at the Universal Protein Resource , 2010, Nucleic Acids Res..

[14]  J.A. Bilmes,et al.  Graphical model architectures for speech recognition , 2005, IEEE Signal Processing Magazine.

[15]  David Haussler,et al.  ENCODE whole-genome data in the UCSC genome browser (2011 update) , 2010, Nucleic Acids Res..

[16]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[17]  J. Harrow,et al.  GENCODE: producing a reference annotation for ENCODE , 2006, Genome Biology.

[18]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Guillaume J. Filion,et al.  Systematic Protein Location Mapping Reveals Five Principal Chromatin Types in Drosophila Cells , 2010, Cell.

[21]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[22]  Dustin E. Schones,et al.  Characterization of human epigenomes. , 2009, Current opinion in genetics & development.

[23]  Piero Carninci,et al.  CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. , 2012, Methods in molecular biology.

[24]  Raymond K. Auerbach,et al.  A User's Guide to the Encyclopedia of DNA Elements (ENCODE) , 2011, PLoS biology.

[25]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[26]  Jeff A. Bilmes,et al.  Transmembrane Topology and Signal Peptide Prediction Using Dynamic Bayesian Networks , 2008, PLoS Comput. Biol..

[27]  Francesca Chiaromonte,et al.  Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences. , 2005, Genome research.

[28]  Jeff A. Bilmes,et al.  Dynamic Bayesian Multinets , 2000, UAI.

[29]  William Stafford Noble,et al.  The Genomedata format for storing large-scale functional genomics data , 2010, Bioinform..

[30]  William Stafford Noble,et al.  Exploratory analysis of genomic segmentations with Segtools , 2011, BMC Bioinformatics.

[31]  Jeff A. Bilmes,et al.  On Triangulating Dynamic Graphical Models , 2002, UAI.

[32]  Frederick P. Roth,et al.  Next generation software for functional trend analysis , 2009, Bioinform..