Motif discovery and motif finding from genome-mapped DNase footprint data

MOTIVATION Footprint data is an important source of information on transcription factor recognition motifs. However, a footprinting fragment can contain no sequences similar to known protein recognition sites. Inspection of genome fragments nearby can help to identify missing site positions. RESULTS Genome fragments containing footprints were supplied to a pipeline that constructed a position weight matrix (PWM) for different motif lengths and selected the optimal PWM. Fragments were aligned with the SeSiMCMC sampler and a new heuristic algorithm, Bigfoot. Footprints with missing hits were found for approximately 50% of factors. Adding only 2 bp on both sides of a footprinting fragment recovered most hits. We automatically constructed motifs for 41 Drosophila factors. New motifs can recognize footprints with a greater sensitivity at the same false positive rate than existing models. Also we discuss possible overfitting of constructed motifs. AVAILABILITY Software and the collection of regulatory motifs are freely available at http://line.imb.ac.ru/DMMPMM.

[1]  Mikhail S. Gelfand,et al.  A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length , 2005, Bioinform..

[2]  Y. Rozanov Probability Theory, Random Processes and Mathematical Statistics , 2011 .

[3]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[4]  Dmitri Papatsenko,et al.  A rationale for the enhanceosome and other evolutionarily constrained enhancers , 2007, Current Biology.

[5]  Tim J. P. Hubbard,et al.  Large-Scale Discovery of Promoter Motifs in Drosophila melanogaster , 2006, PLoS Comput. Biol..

[6]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[7]  G. Stormo,et al.  Analysis of Homeodomain Specificities Allows the Family-wide Prediction of Preferred Recognition Sites , 2008, Cell.

[8]  H. Weintraub,et al.  Differences and similarities in DNA-binding preferences of MyoD and E2A protein complexes revealed by binding site selection. , 1990, Science.

[9]  D. Gifford,et al.  Analysis of the mouse embryonic stem cell regulatory networks obtained by ChIP-chip and ChIP-PET , 2008, Genome Biology.

[10]  Yijun Ruan,et al.  Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array- and sequencing-based technologies. , 2007, Genome research.

[11]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[12]  Mireille Régnier,et al.  Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules , 2007, Algorithms for Molecular Biology.

[13]  Anna G. Nazina,et al.  Extraction of functional binding sites from unique regulatory regions: the Drosophila early developmental enhancers. , 2002, Genome research.

[14]  Anna G. Nazina,et al.  Homotypic regulatory clusters in Drosophila. , 2003, Genome research.

[15]  Casey M. Bergman,et al.  Drosophila DNase I footprint database: a systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila melanogaster , 2005, Bioinform..

[16]  Benno Schwikowski,et al.  Algorithms for Phylogenetic Footprinting , 2002, J. Comput. Biol..

[17]  Thomas D. Schneider,et al.  Discovery of Fur binding site clusters in Escherichia coli by information theory models , 2007, Nucleic acids research.

[18]  M. Noyes,et al.  A systematic characterization of factors that regulate Drosophila segmentation via a bacterial one-hybrid system , 2008, Nucleic acids research.

[19]  W. Pearson,et al.  Current Protocols in Bioinformatics , 2002 .

[20]  E. Wingender,et al.  A compilation of composite regulatory elements affecting gene transcription in vertebrates. , 1995, Nucleic acids research.

[21]  M. Gelfand,et al.  Evolution of transcription factor DNA binding sites. , 2005, Gene.

[22]  Mireille Régnier,et al.  Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression , 2006, Bioinform..

[23]  G. Kneale,et al.  Methods for the analysis of DNA-protein interactions , 1997, Molecular biotechnology.

[24]  E. A. Ananko,et al.  Transcription Regulatory Regions Database (TRRD): A Source of Experimentally Confirmed Data on Transcription Regulatory Regions of Eukaryotic Genes , 2006 .

[25]  Jon D. McAuliffe,et al.  Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome , 2003, Science.

[26]  P. Bucher,et al.  High-throughput SELEX–SAGE method for quantitative modeling of transcription-factor binding sites , 2002, Nature Biotechnology.

[27]  Michael B. Eisen,et al.  Detecting the limits of regulatory element conservation and divergence estimation using pairwise and multiple alignments , 2006, BMC Bioinformatics.

[28]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[29]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[30]  T. Bailey Discovering Novel Sequence Motifs with MEME , 2003, Current protocols in bioinformatics.

[31]  Thomas Werner,et al.  MatInspector and beyond: promoter analysis based on transcription factor binding sites , 2005, Bioinform..

[32]  L. Gold,et al.  Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. , 1990, Science.

[33]  Irene A. Stegun,et al.  Handbook of Mathematical Functions. , 1966 .

[34]  Andrew J. Hampshire,et al.  Footprinting: a method for determining the sequence selectivity, affinity and kinetics of DNA-binding ligands. , 2007, Methods.

[35]  Jun S. Liu,et al.  Extracting sequence features to predict protein–DNA interactions: a comparative study , 2008, Nucleic acids research.

[36]  Nikolay A. Kolchanov,et al.  Bioinformatics of Genome Regulation and Structure , 2013, Springer US.

[37]  Ole Winther,et al.  JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update , 2007, Nucleic Acids Res..

[38]  D. Galas,et al.  DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. , 1978, Nucleic acids research.