Annotation of genomics data using bidirectional hidden Markov models unveils variations in Pol II transcription cycle

DNA replication, transcription and repair involve the recruitment of protein complexes that change their composition as they progress along the genome in a directed or strand‐specific manner. Chromatin immunoprecipitation in conjunction with hidden Markov models (HMMs) has been instrumental in understanding these processes, as they segment the genome into discrete states that can be related to DNA‐associated protein complexes. However, current HMM‐based approaches are not able to assign forward or reverse direction to states or properly integrate strand‐specific (e.g., RNA expression) with non‐strand‐specific (e.g., ChIP) data, which is indispensable to accurately characterize directed processes. To overcome these limitations, we introduce bidirectional HMMs which infer directed genomic states from occupancy profiles de novo. Application to RNA polymerase II‐associated factors in yeast and chromatin modifications in human T cells recovers the majority of transcribed loci, reveals gene‐specific variations in the yeast transcription cycle and indicates the existence of directed chromatin state patterns at transcribed, but not at repressed, regions in the human genome. In yeast, we identify 32 new transcribed loci, a regulated initiation–elongation transition, the absence of elongation factors Ctk1 and Paf1 from a class of genes, a distinct transcription mechanism for highly expressed genes and novel DNA sequence motifs associated with transcription termination. We anticipate bidirectional HMMs to significantly improve the analyses of genome‐associated directed processes.

[1]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[2]  J. Söding,et al.  P-value-based regulatory motif discovery using positional weight matrices , 2013, Genome research.

[3]  Sebastian Bauer,et al.  Model-based gene set analysis for Bioconductor , 2011, Bioinform..

[4]  Martin Vingron,et al.  Variance stabilization applied to microarray data calibration and to the quantification of differential expression , 2002, ISMB.

[5]  Benedikt Zacher,et al.  Analysis of Affymetrix ChIP-chip data using starr and R/Bioconductor. , 2011, Cold Spring Harbor protocols.

[6]  L. Steinmetz,et al.  Bidirectional promoters generate pervasive transcription in yeast , 2009, Nature.

[7]  Gos Micklem,et al.  Supporting Online Material Materials and Methods Figs. S1 to S50 Tables S1 to S18 References Identification of Functional Elements and Regulatory Circuits by Drosophila Modencode , 2022 .

[8]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[9]  Guillaume J. Filion,et al.  Systematic Protein Location Mapping Reveals Five Principal Chromatin Types in Drosophila Cells , 2010, Cell.

[10]  William Stafford Noble,et al.  Unsupervised segmentation of continuous genomic data , 2007, Bioinform..

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[13]  David Botstein,et al.  SGD: Saccharomyces Genome Database , 1998, Nucleic Acids Res..

[14]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[15]  Wolfgang Huber,et al.  Transcript mapping with high-density oligonucleotide tiling arrays , 2006, Bioinform..

[16]  Pierre-Étienne Jacques,et al.  A universal RNA polymerase II CTD cycle is orchestrated by complex interplays between kinase, phosphatase, and isomerase enzymes along genes. , 2012, Molecular cell.

[17]  William Stafford Noble,et al.  Integrative annotation of chromatin elements from ENCODE data , 2012, Nucleic acids research.

[18]  Johannes Söding,et al.  Uniform transitions of the general RNA polymerase II transcription complex , 2010, Nature Structural &Molecular Biology.

[19]  Manolis Kellis,et al.  ChromHMM: automating chromatin-state discovery and characterization , 2012, Nature Methods.

[20]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[21]  Manolis Kellis,et al.  Discovery and characterization of chromatin states for systematic annotation of the human genome , 2010, Nature Biotechnology.

[22]  William Stafford Noble,et al.  Unsupervised pattern discovery in human chromatin structure through genomic segmentation , 2012, Nature Methods.

[23]  Lorenz T. Biegler,et al.  On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming , 2006, Math. Program..

[24]  Bryan J Venters,et al.  A canonical promoter organization of the transcription machinery and its regulators in the Saccharomyces genome , 2008, Genome research.

[25]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[26]  Yvonne Freeh,et al.  Non Negative Matrices And Markov Chains Springer Series In Statistics , 2016 .

[27]  Patrick Cramer,et al.  CTD Tyrosine Phosphorylation Impairs Termination Factor Recruitment to RNA Polymerase II , 2012, Science.

[28]  Timothy J. Durham,et al.  "Systematic" , 1966, Comput. J..

[29]  Bryan J Venters,et al.  A barrier nucleosome model for statistical positioning of nucleosomes throughout the yeast genome. , 2008, Genome research.

[30]  Peter N. Robinson,et al.  GOing Bayesian: model-based gene set analysis of genome-scale data , 2010, Nucleic acids research.

[31]  Piero Fariselli,et al.  A new decoding algorithm for hidden Markov models improves the prediction of the topology of all-beta membrane proteins , 2005, BMC Bioinformatics.

[32]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[33]  Ronald W. Davis,et al.  A high-resolution atlas of nucleosome occupancy in yeast , 2007, Nature Genetics.

[34]  Daniel Schulz,et al.  Transcriptome Surveillance by Selective Termination of Noncoding RNA Synthesis , 2013, Cell.

[35]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[36]  Patrick Cramer,et al.  Cap Completion and C-Terminal Repeat Domain Kinase Recruitment Underlie the Initiation-Elongation Transition of RNA Polymerase II , 2013, Molecular and Cellular Biology.

[37]  Achim Tresch,et al.  Starr: Simple Tiling ARRay analysis of Affymetrix ChIP-chip data , 2009, BMC Bioinformatics.

[38]  E. Seneta Non-negative Matrices and Markov Chains , 2008 .

[39]  William Stafford Noble,et al.  Identification of higher-order functional domains in the human ENCODE regions. , 2007, Genome research.

[40]  Kevin Struhl,et al.  Nucleosome depletion at yeast terminators is not intrinsic and can occur by a transcriptional mechanism linked to 3’-end formation , 2010, Proceedings of the National Academy of Sciences.