fdrMotif: identifying cis-elements by an EM algorithm coupled with false discovery rate control

MOTIVATION Most de novo motif identification methods optimize the motif model first and then separately test the statistical significance of the motif score. In the first stage, a motif abundance parameter needs to be specified or modeled. In the second stage, a Z-score or P-value is used as the test statistic. Error rates under multiple comparisons are not fully considered. METHODOLOGY We propose a simple but novel approach, fdrMotif, that selects as many binding sites as possible while controlling a user-specified false discovery rate (FDR). Unlike existing iterative methods, fdrMotif combines model optimization [e.g. position weight matrix (PWM)] and significance testing at each step. By monitoring the proportion of binding sites selected in many sets of background sequences, fdrMotif controls the FDR in the original data. The model is then updated using an expectation (E)- and maximization (M)-like procedure. We propose a new normalization procedure in the E-step for updating the model. This process is repeated until either the model converges or the number of iterations exceeds a maximum. RESULTS Simulation studies suggest that our normalization procedure assigns larger weights to the binding sites than do two other commonly used normalization procedures. Furthermore, fdrMotif requires only a user-specified FDR and an initial PWM. When tested on 542 high confidence experimental p53 binding loci, fdrMotif identified 569 p53 binding sites in 505 (93.2%) sequences. In comparison, MEME identified more binding sites but in fewer ChIP sequences than fdrMotif. When tested on 500 sets of simulated 'ChIP' sequences with embedded known p53 binding sites, fdrMotif, compared to MEME, has higher sensitivity with similar positive predictive value. Furthermore, fdrMotif is robust to noise: it selected nearly identical binding sites in data adulterated with 50% added background sequences and the unadulterated data. We suggest that fdrMotif represents an improvement over MEME. AVAILABILITY C code can be found at: http://www.niehs.nih.gov/research/resources/software/fdrMotif/.

[1]  D. Bell,et al.  Divergent Evolution of Human p53 Binding Sites: Cell Cycle Versus Apoptosis , 2007, PLoS genetics.

[2]  Michael Q. Zhang,et al.  Analysis of the Vertebrate Insulator Protein CTCF-Binding Sites in the Human Genome , 2007, Cell.

[3]  John D. Storey A direct approach to false discovery rates , 2002 .

[4]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[5]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[6]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[7]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[8]  Z. Weng,et al.  A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome , 2006, Cell.

[9]  B S Weir,et al.  Truncated product method for combining P‐values , 2002, Genetic epidemiology.

[10]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[11]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[12]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[13]  Chen-An Tsai,et al.  Estimation of False Discovery Rates in Multiple Testing: Application to Gene Microarray Data , 2003, Biometrics.

[14]  Michael Q. Zhang,et al.  Identifying tissue-selective transcription factor binding sites in vertebrate promoters. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[15]  A. Atkinson Subset Selection in Regression , 1992 .

[16]  Shane T. Jensen,et al.  Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective , 2004 .

[17]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[18]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[19]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[20]  Kathleen Marchal,et al.  A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling , 2001, Bioinform..

[21]  R. KNÜPPEL,et al.  TRANSFAC Retrieval Program: A Network Model Database of Eukaryotic Transcription Regulating Sequences and Proteins , 1994, J. Comput. Biol..

[22]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[23]  Xiaohui Luo,et al.  Tuning Variable Selection Procedures by Adding Noise , 2006, Technometrics.

[24]  F. Stossi,et al.  Whole-Genome Cartography of Estrogen Receptor α Binding Sites , 2007, PLoS genetics.

[25]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[26]  Yu Liang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm080 Sequence analysis , 2022 .

[27]  L. Stefanski,et al.  Controlling Variable Selection by the Addition of Pseudovariables , 2007 .