Pipeline for the Analysis of ChIP-seq Data and New Motif Ranking Procedure

Pipeline for the Analysis of ChIP-seq Data and New Motif Ranking Procedure Haitham Ashoor This thesis presents a computational methodology for ab-initio identification of transcription factor binding sites based on ChIP-seq data. This method consists of three main steps, namely ChIP-seq data processing, motif discovery and models selection. A novel method for ranking the models of motifs identified in this process is proposed. This method combines multiple factors in order to rank the provided candidate motifs. It combines the model coverage of the ChIP-seq fragments that contain motifs from which that model is built, the suitable background data made up of shuffled ChIP-seq fragments, and the p-value that resulted from evaluating the model on actual and background data. Two ChIP-seq datasets retrieved from ENCODE project are used to evaluate and demonstrate the ability of the method to predict correct TFBSs with high precision. The first dataset relates to neuron-restrictive silencer factor, NRSF, while the second one corresponds to growth-associated binding protein, GABP. The pipeline system shows high precision prediction for both datasets, as in both cases the top ranked motif closely resembles the known motifs for the respective transcription factors.

[1]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[2]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[3]  Raymond K. Auerbach,et al.  PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls , 2009, Nature Biotechnology.

[4]  Raja Jothi,et al.  Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[5]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[6]  T. Mikkelsen,et al.  Genome-wide maps of chromatin state in pluripotent and lineage-committed cells , 2007, Nature.

[7]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[8]  Vladimir B. Bajic,et al.  AN ALGORITHM FOR AB-INITIO DNA MOTIF DETECTION , 2005 .

[9]  A. Mortazavi,et al.  Computation for ChIP-seq and RNA-seq studies , 2009, Nature Methods.

[10]  K. White,et al.  ChIP-chip versus ChIP-seq: Lessons for experimental design and data analysis , 2011, BMC Genomics.

[11]  John J. Wyrick,et al.  Genome-wide location and function of DNA binding proteins. , 2000, Science.

[12]  T A Gray,et al.  Phylogenetic footprinting reveals a nuclear protein which binds to silencer sequences in the human gamma and epsilon globin genes , 1992, Molecular and cellular biology.

[13]  D. Galas,et al.  DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. , 1978, Nucleic acids research.

[14]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[15]  P. Bucher,et al.  Searching for regulatory elements in human noncoding sequences. , 1997, Current opinion in structural biology.

[16]  B. Turner,et al.  Immunoprecipitation of chromatin. , 1996, Methods in enzymology.

[17]  D. Latchman Transcription factors: an overview. , 1997, The international journal of biochemistry & cell biology.

[18]  J. Gralla Activation and repression of E. coli promoters. , 1996, Current opinion in genetics & development.

[19]  Wei Li,et al.  Model-based analysis of two-color arrays (MA2C) , 2007, Genome Biology.

[20]  Steven J. M. Jones,et al.  FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology , 2008, Bioinform..

[21]  Vladimir B. Bajic,et al.  Highly scalable ab initio genomic motif identification , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Chen Zeng,et al.  A clustering approach for identification of enriched domains from histone modification ChIP-Seq data , 2009, Bioinform..

[23]  Helen Pearson,et al.  Genetics: What is a gene? , 2006, Nature.

[24]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[25]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[26]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[27]  G. Tuteja,et al.  Extracting transcription factor targets from ChIP-Seq data , 2009, Nucleic acids research.

[28]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[29]  R. Burgess,et al.  Measurement of binding constants for protein-DNA interactions by DNA-cellulose chromatography. , 1977, Biochemistry.

[30]  Arlindo L. Oliveira,et al.  Bioinformatics Original Paper Musa: a Parameter Free Algorithm for the Identification of Biologically Significant Motifs , 2022 .

[31]  D. L. Bain,et al.  Quantitative DNase footprint titration: a tool for analyzing the energetics of protein–DNA interactions , 2008, Nature Protocols.

[32]  Clifford A. Meyer,et al.  Model-based analysis of tiling-arrays for ChIP-chip , 2006, Proceedings of the National Academy of Sciences.

[33]  R. Myers,et al.  An Integrated Software System for Analyzing Chip-chip and Chip-seq Data (supplementary Information) , 2008 .

[34]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[35]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[36]  T. Kouzarides Chromatin Modifications and Their Function , 2007, Cell.

[37]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[38]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[39]  Zhaohui S. Qin,et al.  On the detection and refinement of transcription factor binding sites using ChIP-Seq data , 2010, Nucleic acids research.

[40]  Raphael Gottardo,et al.  rMAT - an R/Bioconductor package for analyzing ChIP-chip experiments , 2010, Bioinform..

[41]  M. Blanchette,et al.  Discovery of regulatory elements by a computational method for phylogenetic footprinting. , 2002, Genome research.

[42]  A. L. Patterson A Direct Method for the Determination of the Components of Interatomic Distances in Crystals , 1935 .

[43]  Martin Tompa,et al.  MicroFootPrinter: a tool for phylogenetic footprinting in prokaryotic genomes , 2006, Nucleic Acids Res..

[44]  Sayan Mukherjee,et al.  Evidence-ranked motif identification , 2010, Genome Biology.

[45]  Minghui Jiang,et al.  uShuffle: A useful tool for shuffling biological sequences while preserving the k-let counts , 2008, BMC Bioinformatics.

[46]  Terrence S. Furey,et al.  F-Seq: a feature density estimator for high-throughput sequence tags , 2008, Bioinform..

[47]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[48]  Grace Jordison Molecular Biology of the Gene , 1965, The Yale Journal of Biology and Medicine.

[49]  L. Gold,et al.  Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. , 1990, Science.

[50]  M. Bulyk Computational prediction of transcription-factor binding site locations , 2003, Genome Biology.

[51]  Vsevolod J. Makeev,et al.  Deep and wide digging for binding motifs in ChIP-Seq data , 2010, Bioinform..

[52]  S. Batzoglou,et al.  Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[53]  R. Simes,et al.  An improved Bonferroni procedure for multiple tests of significance , 1986 .