HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models

Models of transcription factor (TF) binding sites provide a basis for a wide spectrum of studies in regulatory genomics, from reconstruction of regulatory networks to functional annotation of transcripts and sequence variants. While TFs may recognize different sequence patterns in different conditions, it is pragmatic to have a single generic model for each particular TF as a baseline for practical applications. Here we present the expanded and enhanced version of HOCOMOCO (http://hocomoco.autosome.ru and http://www.cbrc.kaust.edu.sa/hocomoco10), the collection of models of DNA patterns, recognized by transcription factors. HOCOMOCO now provides position weight matrix (PWM) models for binding sites of 601 human TFs and, in addition, PWMs for 396 mouse TFs. Furthermore, we introduce the largest up to date collection of dinucleotide PWM models for 86 (52) human (mouse) TFs. The update is based on the analysis of massive ChIP-Seq and HT-SELEX datasets, with the validation of the resulting models on in vivo data. To facilitate a practical application, all HOCOMOCO models are linked to gene and protein databases (Entrez Gene, HGNC, UniProt) and accompanied by precomputed score thresholds. Finally, we provide command-line tools for PWM and diPWM threshold estimation and motif finding in nucleotide sequences.

[1]  R. Gordân,et al.  Protein–DNA binding: complexities and multi-protein codes , 2013, Nucleic acids research.

[2]  Edgar Wingender,et al.  TFClass: a classification of human transcription factors and their rodent orthologs , 2014, Nucleic Acids Res..

[3]  Marc D. Perry,et al.  ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia , 2012, Genome research.

[4]  B. Wold,et al.  Large-Scale Quality Analysis of Published ChIP-seq Data , 2013, G3: Genes, Genomes, Genetics.

[5]  Ben D. MacArthur,et al.  Single-Cell Analyses of ESCs Reveal Alternative Pluripotent Cell States and Molecular Mechanisms that Control Self-Renewal , 2015, Stem cell reports.

[6]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[7]  Victor G. Levitsky,et al.  From binding motifs in Chip-seq Data to Improved Models of transcription factor binding Sites , 2013, J. Bioinform. Comput. Biol..

[8]  Edgar Wingender,et al.  The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation , 2008, Briefings Bioinform..

[9]  Cesare Furlanello,et al.  A promoter-level mammalian expression atlas , 2015 .

[10]  Vsevolod J. Makeev,et al.  Deep and wide digging for binding motifs in ChIP-Seq data , 2010, Bioinform..

[11]  M. Snyder,et al.  Recurrent Somatic Mutations in Regulatory Regions of Human Cancer Genomes , 2015, Nature Genetics.

[12]  Kate B. Cook,et al.  Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity , 2014, Cell.

[13]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[14]  Mikhail Pachkov,et al.  SwissRegulon, a database of genome-wide annotations of regulatory sites: recent updates , 2012, Nucleic Acids Res..

[15]  Yongchao Liu,et al.  CompleteMOTIFs: DNA motif discovery platform for transcription factor binding experiments , 2010, Bioinform..

[16]  V. Makeev,et al.  Discovery of DNA motifs recognized by transcription factors through integration of different experimental sources , 2009 .

[17]  V. Makeev,et al.  Application of experimentally verified transcription factor binding sites models for computational analysis of ChIP-Seq data , 2014, BMC Genomics.

[18]  C. Glass,et al.  Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. , 2010, Molecular cell.

[19]  William Stafford Noble,et al.  Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors , 2012, Genome research.

[20]  Hideya Kawaji,et al.  Effects of cytosine methylation on transcription factor binding sites , 2014, BMC Genomics.

[21]  E. Barillot,et al.  The Oncogenic EWS-FLI1 Protein Binds In Vivo GGAA Microsatellite Sequences with Potential Transcriptional Activation Function , 2009, PloS one.

[22]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[23]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[24]  Jens Keilwagen,et al.  Varying levels of complexity in transcription factor binding motifs , 2015, Nucleic acids research.

[25]  Vladimir B. Bajic,et al.  Insights into the Transcriptional Architecture of Behavioral Plasticity in the Honey Bee Apis mellifera , 2015, Scientific Reports.

[26]  Abdullah M. Khamis,et al.  Regional differences in gene expression and promoter usage in aged human brains , 2013, Neurobiology of Aging.

[27]  Vladimir B. Bajic,et al.  Promoter Analysis Reveals Globally Differential Regulation of Human Long Non-Coding RNA and Protein-Coding Genes , 2014, PloS one.

[28]  R. Mantovani,et al.  YB-1 (YBX1) does not bind to Y/CCAAT boxes in vivo , 2013, Oncogene.

[29]  Jing Liu,et al.  CR Cistrome: a ChIP-Seq database for chromatin regulators and histone modification linkages in human and mouse , 2013, Nucleic Acids Res..

[30]  Manolis Kellis,et al.  Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments , 2013, Nucleic acids research.

[31]  Michael Q. Zhang,et al.  A highly efficient and effective motif discovery method for ChIP-seq/ChIP-chip data using positional information , 2011, Nucleic acids research.

[32]  Vladimir B. Bajic,et al.  Mutations and Binding Sites of Human Transcription Factors , 2012, Front. Gene..

[33]  Ariel S. Schwartz,et al.  An Atlas of Combinatorial Transcriptional Regulation in Mouse and Man , 2010, Cell.

[34]  Bartek Wilczynski,et al.  Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data , 2015, BMC Bioinformatics.

[35]  Bruno Contreras-Moreira,et al.  footprintDB: a database of transcription factors with annotated cis elements and binding interfaces , 2014, Bioinform..

[36]  Vladimir B. Bajic,et al.  HOCOMOCO: a comprehensive collection of human transcription factor binding sites models , 2012, Nucleic Acids Res..

[37]  V. Makeev,et al.  DNA sequence motif: a jack of all trades for ChIP-Seq data. , 2013, Advances in protein chemistry and structural biology.

[38]  Saurabh Sinha,et al.  Program in Gene Function and Expression Publications and Presentations Program in Gene Function and Expression 9-2013 Widespread evidence of cooperative DNA binding by transcription factors in Drosophila development , 2014 .

[39]  Jean-Stéphane Varré,et al.  Efficient and accurate P-value computation for Position Weight Matrices , 2007, Algorithms for Molecular Biology.

[40]  Juan M. Vaquerizas,et al.  A census of human transcription factors: function, expression and evolution , 2009, Nature Reviews Genetics.

[41]  Gary D. Stormo,et al.  Introduction to Protein-DNA Interactions: Structure, Thermodynamics, and Bioinformatics , 2013 .

[42]  Raja Jothi,et al.  Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[43]  Juan M. Vaquerizas,et al.  DNA-Binding Specificities of Human Transcription Factors , 2013, Cell.

[44]  David J. Arenillas,et al.  JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles , 2013, Nucleic Acids Res..

[45]  Hyunsoo Kim,et al.  Tree-Based Position Weight Matrix Approach to Model Transcription Factor Binding Site Profiles , 2011, PloS one.

[46]  Gary D Stormo,et al.  DNA Motif Databases and Their Uses , 2015, Current protocols in bioinformatics.

[47]  Vsevolod J. Makeev,et al.  Jaccard index based similarity measure to compare transcription factor binding site models , 2013, Algorithms for Molecular Biology.

[48]  Esko Ukkonen,et al.  MOODS: fast search for position weight matrix matches in DNA sequences , 2009, Bioinform..