DISPARE: DIScriminative PAttern REfinement for Position Weight Matrices

BackgroundThe accurate determination of transcription factor binding affinities is an important problem in biology and key to understanding the gene regulation process. Position weight matrices are commonly used to represent the binding properties of transcription factor binding sites but suffer from low information content and a large number of false matches in the genome. We describe a novel algorithm for the refinement of position weight matrices representing transcription factor binding sites based on experimental data, including ChIP-chip analyses. We present an iterative weight matrix optimization method that is more accurate in distinguishing true transcription factor binding sites from a negative control set. The initial position weight matrix comes from JASPAR, TRANSFAC or other sources. The main new features are the discriminative nature of the method and matrix width and length optimization.ResultsThe algorithm was applied to the increasing collection of known transcription factor binding sites obtained from ChIP-chip experiments. The results show that our algorithm significantly improves the sensitivity and specificity of matrix models for identifying transcription factor binding sites.ConclusionWhen the transcription factor is known, it is more appropriate to use a discriminative approach such as the one presented here to derive its transcription factor-DNA binding properties starting with a matrix, as opposed to performing de novo motif discovery. Generating more accurate position weight matrices will ultimately contribute to a better understanding of eukaryotic transcriptional regulation, and could potentially offer a better alternative to ab initio motif discovery.

[1]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[2]  Ole Winther,et al.  JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update , 2007, Nucleic Acids Res..

[3]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[4]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[5]  Tatsuhiko Tsunoda,et al.  Estimating transcription factor bindability on DNA , 1999, Bioinform..

[6]  F. Stossi,et al.  Whole-Genome Cartography of Estrogen Receptor α Binding Sites , 2007, PLoS genetics.

[7]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[8]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[9]  Ole Winther,et al.  BayesMD: Flexible Biological Modeling for Motif Discovery , 2008, J. Comput. Biol..

[10]  Z. Weng,et al.  A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome , 2006, Cell.

[11]  T. Hubbard,et al.  NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence , 2005, Nucleic acids research.

[12]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[13]  Amos Tanay,et al.  Extensive low-affinity transcriptional interactions in the yeast genome. , 2006, Genome research.

[14]  M. Bulyk Computational prediction of transcription-factor binding site locations , 2003, Genome Biology.

[15]  R. Mirimanoff,et al.  Clinical implications of the p53 tumor-suppressor gene. , 1994, The New England journal of medicine.

[16]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[17]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[18]  T. Harvey,et al.  Discrimination of DNA binding sites by mutant p53 proteins , 1995, Molecular and cellular biology.

[19]  P. Bucher Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. , 1990, Journal of molecular biology.

[20]  Simak Ali,et al.  Estrogen Receptor Alpha in Human Breast Cancer: Occurrence and Significance , 2000, Journal of Mammary Gland Biology and Neoplasia.

[21]  G. Stormo,et al.  Computational technique for improvement of the position-weight matrices for the DNA/protein binding sites , 2005, Nucleic acids research.

[22]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[23]  J. Shay,et al.  A transcriptionally active DNA-binding site for human p53 protein complexes , 1992, Molecular and cellular biology.

[24]  M. Hollstein,et al.  Clinical implications of the p53 tumor-suppressor gene. , 1993, The New England journal of medicine.

[25]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[26]  Nan Li,et al.  Analysis of computational approaches for motif discovery , 2006, Algorithms for Molecular Biology.

[27]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[28]  Timo Pylvänäinen,et al.  Automatic and adaptive calibration of 3D field sensors , 2008 .

[29]  Casey M. Bergman,et al.  Drosophila DNase I footprint database: a systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila melanogaster , 2005, Bioinform..

[30]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[31]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[32]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[33]  Martin C. Frith,et al.  Discovering Sequence Motifs with Arbitrary Insertions and Deletions , 2008, PLoS Comput. Biol..

[34]  S. Ledermann Kullback S. — Information Theory and Statistics , 1962 .

[35]  N. Slonim,et al.  A universal framework for regulatory element discovery across all genomes and data types. , 2007, Molecular cell.

[36]  C. Allis,et al.  In vivo cross-linking and immunoprecipitation for studying dynamic Protein:DNA associations in a chromatin environment. , 1999, Methods.

[37]  Anna G. Nazina,et al.  Extraction of functional binding sites from unique regulatory regions: the Drosophila early developmental enhancers. , 2002, Genome research.

[38]  X. Chen,et al.  The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells , 2006, Nature Genetics.