The Limits of De Novo DNA Motif Discovery

A major challenge in molecular biology is reverse-engineering the cis-regulatory logic that plays a major role in the control of gene expression. This program includes searching through DNA sequences to identify “motifs” that serve as the binding sites for transcription factors or, more generally, are predictive of gene expression across cellular conditions. Several approaches have been proposed for de novo motif discovery–searching sequences without prior knowledge of binding sites or nucleotide patterns. However, unbiased validation is not straightforward. We consider two approaches to unbiased validation of discovered motifs: testing the statistical significance of a motif using a DNA “background” sequence model to represent the null hypothesis and measuring performance in predicting membership in gene clusters. We demonstrate that the background models typically used are “too null,” resulting in overly optimistic assessments of significance, and argue that performance in predicting TF binding or expression patterns from DNA motifs should be assessed by held-out data, as in predictive learning. Applying this criterion to common motif discovery methods resulted in universally poor performance, although there is a marked improvement when motifs are statistically significant against real background sequences. Moreover, on synthetic data where “ground truth” is known, discriminative performance of all algorithms is far below the theoretical upper bound, with pronounced “over-fitting” in training. A key conclusion from this work is that the failure of de novo discovery approaches to accurately identify motifs is basically due to statistical intractability resulting from the fixed size of co-regulated gene clusters, and thus such failures do not necessarily provide evidence that unfound motifs are not active biologically. Consequently, the use of prior knowledge to enhance motif discovery is not just advantageous but necessary. An implementation of the LR and ALR algorithms is available at http://code.google.com/p/likelihood-ratio-motifs/.

[1]  Jacob F. Degner,et al.  Sequence and Chromatin Accessibility Data Accurate Inference of Transcription Factor Binding from Dna Material Supplemental Open Access , 2022 .

[2]  J. Lieb,et al.  Evidence for nucleosome depletion at active regulatory regions genome-wide , 2004, Nature Genetics.

[3]  Michael Q. Zhang,et al.  Identifying tissue-selective transcription factor binding sites in vertebrate promoters. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Sven Rahmann,et al.  Efficient exact motif discovery , 2009, Bioinform..

[5]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[6]  Mathieu Blanchette,et al.  Seeder: discriminative seeding DNA motif discovery , 2008, Bioinform..

[7]  S. S. Wilks The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .

[8]  J. Svaren,et al.  Transcription factors vs nucleosomes: regulation of the PHO5 promoter in yeast. , 1997, Trends in biochemical sciences.

[9]  Israel Steinfeld,et al.  Developmental programming of CpG island methylation profiles in the human genome , 2009, Nature Structural &Molecular Biology.

[10]  William Stafford Noble,et al.  Kernels for gene regulatory regions , 2005, NIPS.

[11]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[12]  Alexander J. Hartemink,et al.  Nucleosome Occupancy Information Improves de novo Motif Discovery , 2007, RECOMB.

[13]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[14]  Saurabh Sinha,et al.  Discriminative motifs , 2002, RECOMB '02.

[15]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[16]  J. Marmur,et al.  [109] Use of ultraviolet absorbance-temperature profile for determining the guanine plus cytosine content of DNA , 1968 .

[17]  Michael A. Beer,et al.  Discriminative prediction of mammalian enhancers from DNA sequence. , 2011, Genome research.

[18]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[19]  Ole Winther,et al.  JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update , 2007, Nucleic Acids Res..

[20]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[21]  William Stafford Noble,et al.  Nucleosome positioning signals in genomic DNA. , 2007, Genome research.

[22]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[23]  Martin C. Frith,et al.  Discovering Sequence Motifs with Arbitrary Insertions and Deletions , 2008, PLoS Comput. Biol..

[24]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[25]  Paul A Clemons,et al.  The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease , 2006, Science.

[26]  Finn Drabløs,et al.  Improved benchmarks for computational motif discovery , 2007, BMC Bioinformatics.

[27]  Nikolaus Rajewsky,et al.  Correlating Gene Expression Variation with cis-Regulatory Polymorphism in Saccharomyces cerevisiae , 2010, Genome biology and evolution.

[28]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[29]  M. Batzer,et al.  Alu elements and hominid phylogenetics , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[30]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[31]  G. Stormo,et al.  Ann-spec: a Method for Discovering Transcription Factor Binding Sites with Improved Specificity , 2022 .

[32]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[33]  E. Lander,et al.  A large family of ancient repeat elements in the human genome is under strong selection. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Michael A. Beer,et al.  Predicting Gene Expression from Sequence , 2004, Cell.

[35]  Bartek Wilczyński,et al.  Dynamic CRM occupancy reflects a temporal map of developmental progression , 2010, Molecular systems biology.

[36]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[37]  Timothy L. Bailey,et al.  Discriminative motif discovery in DNA and protein sequences using the DEME algorithm , 2007, BMC Bioinformatics.

[38]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[39]  D. Brutlag,et al.  A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Erik van Nimwegen,et al.  PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny , 2005, PLoS Comput. Biol..