Analysis of Genomic Sequence Motifs for Deciphering Transcription Factor Binding and Transcriptional Regulation in Eukaryotic Cells

Eukaryotic genomes contain a variety of structured patterns: repetitive elements, binding sites of DNA and RNA associated proteins, splice sites, and so on. Often, these structured patterns can be formalized as motifs and described using a proper mathematical model such as position weight matrix and IUPAC consensus. Two key tasks are typically carried out for motifs in the context of the analysis of genomic sequences. These are: identification in a set of DNA regions of over-represented motifs from a particular motif database, and de novo discovery of over-represented motifs. Here we describe existing methodology to perform these two tasks for motifs characterizing transcription factor binding. When applied to the output of ChIP-seq and ChIP-exo experiments, or to promoter regions of co-modulated genes, motif analysis techniques allow for the prediction of transcription factor binding events and enable identification of transcriptional regulators and co-regulators. The usefulness of motif analysis is further exemplified in this review by how motif discovery improves peak calling in ChIP-seq and ChIP-exo experiments and, when coupled with information on gene expression, allows insights into physical mechanisms of transcriptional modulation.

[1]  S. Holban,et al.  A review of ensemble methods for de novo motif discovery in ChIP-Seq data , 2015, Briefings Bioinform..

[2]  Vladimir B. Bajic,et al.  Promoter Analysis Reveals Globally Differential Regulation of Human Long Non-Coding RNA and Protein-Coding Genes , 2014, PloS one.

[3]  Emmanuel Barillot,et al.  Spi-1/PU.1 activates transcription through clustered DNA occupancy in erythroleukemia , 2012, Nucleic acids research.

[4]  G. Collins The next generation. , 2006, Scientific American.

[5]  E. Barillot,et al.  Spi-1/PU.1 oncogene accelerates DNA replication fork elongation and promotes genetic instability in the absence of DNA breakage. , 2010, Cancer research.

[6]  Vladimir Shelest,et al.  DistanceScan: a tool for promoter modeling , 2010, Bioinform..

[7]  C. Glass,et al.  Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. , 2010, Molecular cell.

[8]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[9]  Ivo Grosse,et al.  VOMBAT: prediction of transcription factor binding sites using variable order Bayesian trees , 2006, Nucleic Acids Res..

[10]  Michael Q. Zhang,et al.  OSCAR: One-class SVM for accurate recognition of cis-elements , 2007, Bioinform..

[11]  B. Pugh,et al.  Comprehensive Genome-wide Protein-DNA Interactions Detected at Single-Nucleotide Resolution , 2011, Cell.

[12]  Jan Holub,et al.  The finite automata approaches in stringology , 2012, Kybernetika.

[13]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[14]  A. Hartemink,et al.  An ensemble model of competitive multi-factor binding of the genome. , 2009, Genome research.

[15]  Konstantin Kozlov,et al.  Analysis of functional importance of binding sites in the Drosophila gap gene network model , 2015, BMC Genomics.

[16]  Céline Hernandez,et al.  ChIP-exo signal associated with DNA-binding motifs provides insight into the genomic binding of the glucocorticoid receptor and cooperating transcription factors , 2015, Genome research.

[17]  Edgar Wingender,et al.  PC-TraFF: identification of potentially collaborating transcription factors using pointwise mutual information , 2015, BMC Bioinformatics.

[18]  Marc D. Perry,et al.  ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia , 2012, Genome research.

[19]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[20]  P. Marker The Polycomb group protein EZH2 directly controls DNA methylation , 2007 .

[21]  Vladimir Shelest,et al.  SiTaR: a novel tool for transcription factor binding site prediction , 2011, Bioinform..

[22]  G. Stormo,et al.  Quantitative analysis demonstrates most transcription factors require only simple models of specificity , 2011, Nature Biotechnology.

[23]  Erik van Nimwegen,et al.  SwissRegulon: a database of genome-wide annotations of regulatory sites , 2006, Nucleic Acids Res..

[24]  R. Shamir,et al.  Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. , 2008, Genome research.

[25]  Anders Krogh,et al.  Asap: A Framework for Over-Representation Statistics for Transcription Factor Binding Sites , 2008, PloS one.

[26]  M. Facciotti,et al.  Evaluation of Algorithm Performance in ChIP-Seq Peak Detection , 2010, PloS one.

[27]  N. Brockdorff,et al.  Chromatin Sampling—An Emerging Perspective on Targeting Polycomb Repressor Proteins , 2013, PLoS genetics.

[28]  Caiyan Jia,et al.  A New Exhaustive Method and Strategy for Finding Motifs in ChIP-Enriched Regions , 2014, PloS one.

[29]  J. Keilwagen,et al.  On the Value of Intra-Motif Dependencies of Human Insulator Protein CTCF , 2014, PloS one.

[30]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[31]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[32]  Eugene Bolotin,et al.  Prevalence of the initiator over the TATA box in human and yeast genes and identification of DNA motifs enriched in human TATA-less core promoters. , 2007, Gene.

[33]  Zhaohui S. Qin,et al.  On the detection and refinement of transcription factor binding sites using ChIP-Seq data , 2010, Nucleic acids research.

[34]  Bruno Contreras-Moreira,et al.  footprintDB: a database of transcription factors with annotated cis elements and binding interfaces , 2014, Bioinform..

[35]  David J. Arenillas,et al.  oPOSSUM-3: Advanced Analysis of Regulatory Motif Over-Representation Across Genes or ChIP-Seq Datasets , 2012, G3: Genes | Genomes | Genetics.

[36]  William Stafford Noble,et al.  High Resolution Models of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions , 2010, PLoS Comput. Biol..

[37]  Michael Q. Zhang,et al.  A highly efficient and effective motif discovery method for ChIP-seq/ChIP-chip data using positional information , 2011, Nucleic acids research.

[38]  Steven J. M. Jones,et al.  FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology , 2008, Bioinform..

[39]  Clifford A. Meyer,et al.  Cistrome: an integrative platform for transcriptional regulation studies , 2011, Genome Biology.

[40]  Tiejun Tong,et al.  A short survey of computational analysis methods in analysing ChIP-seq data , 2010, Human Genomics.

[41]  C. V. Jongeneel,et al.  Indexing Strategies for Rapid Searches of Short Words in Genome Sequences , 2007, PloS one.

[42]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[43]  Kathleen Marchal,et al.  ModuleDigger: an itemset mining framework for the detection of cis-regulatory modules , 2009, BMC Bioinformatics.

[44]  Shawn M. Gillespie,et al.  EWS-FLI1 utilizes divergent chromatin remodeling mechanisms to directly activate or repress enhancer elements in Ewing sarcoma. , 2014, Cancer cell.

[45]  Mireille Régnier,et al.  Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules , 2007, Algorithms for Molecular Biology.

[46]  X. Cao,et al.  Tandem repeat of C/EBP binding sites mediates PPARγ2 gene transcription in glucocorticoid‐induced adipocyte differentiation , 2000, Journal of cellular biochemistry.

[47]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[48]  Panayiotis V. Benos,et al.  STAMP: a web tool for exploring DNA-binding motif similarities , 2007, Nucleic Acids Res..

[49]  Vsevolod J. Makeev,et al.  Jaccard index based similarity measure to compare transcription factor binding site models , 2013, Algorithms for Molecular Biology.

[50]  D. Bartel MicroRNAs: Target Recognition and Regulatory Functions , 2009, Cell.

[51]  W. Earnshaw,et al.  CENP-C binds the alpha-satellite DNA in vivo at specific centromere domains. , 2002, Journal of cell science.

[52]  J. Dekker,et al.  Structural and functional diversity of Topologically Associating Domains , 2015, FEBS letters.

[53]  M. Bulyk,et al.  Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. , 2013, Cell reports.

[54]  Sayan Mukherjee,et al.  Evidence-ranked motif identification , 2010, Genome Biology.

[55]  Dongwon Lee,et al.  kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets , 2013, Nucleic Acids Res..

[56]  Eurie L. Hong,et al.  Annotation of functional variation in personal genomes using RegulomeDB , 2012, Genome research.

[57]  D. S. Chekmenev,et al.  P-Match: transcription factor binding site search by combining patterns and weight matrices , 2005, Nucleic Acids Res..

[58]  Tobias Marschall,et al.  Construction of minimal deterministic finite automata from biological motifs , 2011, Theor. Comput. Sci..

[59]  Graziano Pesole,et al.  Pscan: finding over-represented transcription factor binding site motifs in sequences from co-regulated or co-expressed genes , 2009, Nucleic Acids Res..

[60]  Martin C. Frith,et al.  Cluster-Buster: finding dense clusters of motifs in DNA sequences , 2003, Nucleic Acids Res..

[61]  Vladimir A. Kuznetsov,et al.  Sense-antisense gene-pairs in breast cancer and associated pathological pathways , 2015, Oncotarget.

[62]  M.J. Lutz,et al.  Flexible Pattern Matching in Strings: Practical Online Search Algorithms for Texts and Biological Sequences [Book Review] , 2002, Computer.

[63]  Manolis Kellis,et al.  HaploReg v4: systematic mining of putative causal variants, cell types, regulators and target genes for human complex traits and disease , 2015, Nucleic Acids Res..

[64]  F. Slack,et al.  A SNP in a let-7 microRNA complementary site in the KRAS 3' untranslated region increases non-small cell lung cancer risk. , 2008, Cancer research.

[65]  S. Aerts,et al.  i-cisTarget: an integrative genomics method for the prediction of regulatory features and cis-regulatory modules , 2012, Nucleic acids research.

[66]  Alexander V. Favorov,et al.  CORECLUST: identification of the conserved CRM grammar together with prediction of gene regulation , 2012, Nucleic acids research.

[67]  William Stafford Noble,et al.  Epigenetic priors for identifying active transcription factor binding sites , 2012, Bioinform..

[68]  Chun-Hsi Huang,et al.  A survey of motif finding Web tools for detecting binding site motifs in ChIP-Seq data , 2014, Biology Direct.

[69]  Alexander E. Kel,et al.  MatrixCatch - a novel tool for the recognition of composite regulatory elements in promoters , 2013, BMC Bioinformatics.

[70]  Philip Machanick,et al.  MEME-ChIP: motif analysis of large DNA datasets , 2011, Bioinform..

[71]  Vsevolod J. Makeev,et al.  Deep and wide digging for binding motifs in ChIP-Seq data , 2010, Bioinform..

[72]  Yufei Huang,et al.  Survey of Computational Algorithms for MicroRNA Target Prediction , 2009, Current genomics.

[73]  Borivoj Melichar,et al.  Finding Common Motifs with Gaps Using Finite Automata , 2006, CIAA.

[74]  Sven Rahmann,et al.  Probabilistic Arithmetic Automata and Their Application to Pattern Matching Statistics , 2008, CPM.

[75]  Stein Aerts,et al.  i-cisTarget 2015 update: generalized cis-regulatory enrichment analysis in human, mouse and fly , 2015, Nucleic Acids Res..

[76]  David G. Knowles,et al.  Fast Computation and Applications of Genome Mappability , 2012, PloS one.

[77]  Jens Keilwagen,et al.  A general approach for discriminative de novo motif discovery from high-throughput data , 2013, GCB.

[78]  M. Berger,et al.  Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors , 2009, Nature Protocols.

[79]  T. Stopka,et al.  The role of PU.1 and GATA-1 transcription factors during normal and leukemogenic hematopoiesis , 2010, Leukemia.

[80]  Mireille Régnier,et al.  Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression , 2006, Bioinform..

[81]  Steven Henikoff,et al.  High-resolution mapping of transcription factor binding sites on native chromatin , 2013, Epigenetics & Chromatin.

[82]  Barbara E. Engelhardt,et al.  Stability selection for regression-based models of transcription factor–DNA binding specificity , 2013, Bioinform..

[83]  Emmanuel Barillot,et al.  Nebula - a web-server for advanced ChIP-seq data analysis , 2012, Bioinform..

[84]  Wyeth W. Wasserman,et al.  The Next Generation of Transcription Factor Binding Site Prediction , 2013, PLoS Comput. Biol..

[85]  Michael A. Beer,et al.  Integration of ChIP-seq and machine learning reveals enhancers and a predictive regulatory sequence vocabulary in melanocytes , 2012, Genome research.

[86]  Lorenz Wernisch,et al.  Variable structure motifs for transcription factor binding sites , 2010, BMC Genomics.

[87]  David J. Arenillas,et al.  JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles , 2015, Nucleic Acids Res..

[88]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[89]  Yves Moreau,et al.  ModuleMiner - improved computational detection of cis-regulatory modules: are there different modes of gene regulation in embryonic development and adult tissues? , 2008, Genome Biology.

[90]  Timothy L. Bailey,et al.  Motif Enrichment Analysis: a unified framework and an evaluation on ChIP data , 2010, BMC Bioinformatics.

[91]  Atina G. Coté,et al.  Evaluation of methods for modeling transcription factor sequence specificity , 2013, Nature Biotechnology.

[92]  William Stafford Noble,et al.  MCAST: scanning for cis-regulatory motif clusters , 2016, Bioinform..

[93]  S. Behura,et al.  Bidirectional promoters of insects: genome-wide comparison, evolutionary implication and influence on gene expression. , 2015, Journal of molecular biology.

[94]  Zhiping Weng,et al.  Transcription factor binding and modified histones in human bidirectional promoters. , 2007, Genome research.

[95]  Victor G. Levitsky,et al.  From binding motifs in Chip-seq Data to Improved Models of transcription factor binding Sites , 2013, J. Bioinform. Comput. Biol..

[96]  Z. Weng,et al.  Detection of functional DNA motifs via statistical over-representation. , 2004, Nucleic acids research.

[97]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[98]  Yuchun Guo,et al.  High Resolution Genome Wide Binding Event Finding and Motif Discovery Reveals Transcription Factor Spatial Binding Constraints , 2012, PLoS Comput. Biol..

[99]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[100]  Armin Shmilovici,et al.  Identification of transcription factor binding sites with variable-order Bayesian networks , 2005, Bioinform..

[101]  Daniel E. Newburger,et al.  Diversity and Complexity in DNA Recognition by Transcription Factors , 2009, Science.

[102]  Emmanuel Barillot,et al.  De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis , 2010, Nucleic acids research.

[103]  J. Söding,et al.  P-value-based regulatory motif discovery using positional weight matrices , 2013, Genome research.

[104]  William Stafford Noble,et al.  Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors , 2012, Genome research.

[105]  Jens Keilwagen,et al.  Varying levels of complexity in transcription factor binding motifs , 2015, Nucleic acids research.

[106]  M. Kon,et al.  Integrating genomic data to predict transcription factor binding. , 2005, Genome informatics. International Conference on Genome Informatics.

[107]  Ziv Bar-Joseph,et al.  Predicting tissue specific transcription factor binding sites , 2013, BMC Genomics.

[108]  P. Farnham Insights from genomic profiling of transcription factors , 2009, Nature Reviews Genetics.

[109]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[110]  E. Barillot,et al.  The Oncogenic EWS-FLI1 Protein Binds In Vivo GGAA Microsatellite Sequences with Potential Transcriptional Activation Function , 2009, PloS one.

[111]  Denis Thieffry,et al.  RSAT 2015: Regulatory Sequence Analysis Tools , 2015, Nucleic Acids Res..

[112]  Timothy L. Bailey,et al.  Tissue-specific prediction of directly regulated genes , 2011, Bioinform..

[113]  Vladimir B. Bajic,et al.  HOCOMOCO: a comprehensive collection of human transcription factor binding sites models , 2012, Nucleic Acids Res..

[114]  Ron Shamir,et al.  Allegro: Analyzing expression and sequence in concert to discover regulatory programs , 2009, Nucleic acids research.

[115]  Kathleen Marchal,et al.  Unveiling combinatorial regulation through the combination of ChIP information and in silico cis-regulatory module detection , 2012, Nucleic acids research.

[116]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[117]  G. Stormo,et al.  Improved Models for Transcription Factor Binding Site Identification Using Nonindependent Interactions , 2012, Genetics.

[118]  Jiashun Zheng,et al.  An approach to identify over-represented cis-elements in related sequences. , 2003, Nucleic acids research.

[119]  Li Ding,et al.  Complete characterization of the microRNAome in a patient with acute myeloid leukemia. , 2010, Blood.