Discriminative motif optimization based on perceptron training

MOTIVATION Generating accurate transcription factor (TF) binding site motifs from data generated using the next-generation sequencing, especially ChIP-seq, is challenging. The challenge arises because a typical experiment reports a large number of sequences bound by a TF, and the length of each sequence is relatively long. Most traditional motif finders are slow in handling such enormous amount of data. To overcome this limitation, tools have been developed that compromise accuracy with speed by using heuristic discrete search strategies or limited optimization of identified seed motifs. However, such strategies may not fully use the information in input sequences to generate motifs. Such motifs often form good seeds and can be further improved with appropriate scoring functions and rapid optimization. RESULTS We report a tool named discriminative motif optimizer (DiMO). DiMO takes a seed motif along with a positive and a negative database and improves the motif based on a discriminative strategy. We use area under receiver-operating characteristic curve (AUC) as a measure of discriminating power of motifs and a strategy based on perceptron training that maximizes AUC rapidly in a discriminative manner. Using DiMO, on a large test set of 87 TFs from human, drosophila and yeast, we show that it is possible to significantly improve motifs identified by nine motif finders. The motifs are generated/optimized using training sets and evaluated on test sets. The AUC is improved for almost 90% of the TFs on test sets and the magnitude of increase is up to 39%. AVAILABILITY AND IMPLEMENTATION DiMO is available at http://stormo.wustl.edu/DiMO

[1]  Saurabh Sinha,et al.  Discriminative motifs , 2002, RECOMB '02.

[2]  G. Stormo,et al.  Determining the specificity of protein–DNA interactions , 2010, Nature Reviews Genetics.

[3]  J. Söding,et al.  P-value-based regulatory motif discovery using positional weight matrices , 2013, Genome research.

[4]  R. Shamir,et al.  Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. , 2008, Genome research.

[5]  J. Younger,et al.  Statistical methodology: III. Receiver operating characteristic (ROC) curves. , 1997, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[6]  Viv Bewick,et al.  Statistics review 13: Receiver operating characteristic curves , 2004, Critical care.

[7]  Mithat Gönen,et al.  Receiver Operating Characteristic (ROC) Curves , 2006 .

[8]  Olivier Elemento,et al.  DISPARE: DIScriminative PAttern REfinement for Position Weight Matrices , 2009, BMC Bioinformatics.

[9]  Atina G. Coté,et al.  Evaluation of methods for modeling transcription factor sequence specificity , 2013, Nature Biotechnology.

[10]  T. Furey ChIP – seq and beyond : new and improved methodologies to detect and characterize protein – DNA interactions , 2012 .

[11]  Vsevolod J. Makeev,et al.  Deep and wide digging for binding motifs in ChIP-Seq data , 2010, Bioinform..

[12]  Rahul Siddharthan,et al.  PhyloGibbs-MP: Module Prediction and Discriminative Motif-Finding by Gibbs Sampling , 2008, PLoS Comput. Biol..

[13]  Roded Sharan,et al.  A motif-based framework for recognizing sequence families , 2005, ISMB.

[14]  Saurabh Sinha,et al.  On counting position weight matrix matches in a sequence, with application to discriminative motif finding , 2006, ISMB.

[15]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[16]  Ziv Bar-Joseph,et al.  DECOD: fast and accurate discriminative DNA motif finding , 2011, Bioinform..

[17]  Michael Q. Zhang,et al.  Identifying tissue-selective transcription factor binding sites in vertebrate promoters. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Qing Zhou,et al.  Identification of Context-Dependent Motifs by Contrasting ChIP Binding Data , 2010, Bioinform..

[19]  Francis Y. L. Chin,et al.  Finding motifs from all sequences with and without binding sites , 2006, Bioinform..

[20]  J. Helden,et al.  A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs , 2012, Nature Protocols.

[21]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[22]  R. A. Lejk Statistical methodology II , 1974, WSC '74.

[23]  Weixiong Zhang,et al.  WordSpy: identifying transcription factor binding motifs by building a dictionary and learning a grammar , 2005, Nucleic Acids Res..

[24]  Philip N. Benfey,et al.  POWRS: Position-Sensitive Motif Discovery , 2012, PloS one.

[25]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[26]  T. Bailey,et al.  Inferring direct DNA binding from ChIP-seq , 2012, Nucleic acids research.

[27]  Frank Rosenblatt,et al.  PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , 1963 .

[28]  Wenjie Fu,et al.  DISCOVER: a feature-based discriminative method for motif search in complex genomes , 2009, Bioinform..

[29]  Mathieu Blanchette,et al.  Seeder: discriminative seeding DNA motif discovery , 2008, Bioinform..

[30]  Timothy L. Bailey,et al.  Discriminative motif discovery in DNA and protein sequences using the DEME algorithm , 2007, BMC Bioinformatics.

[31]  N. Slonim,et al.  A universal framework for regulatory element discovery across all genomes and data types. , 2007, Molecular cell.

[32]  Yanzhi Du,et al.  AMD, an Automated Motif Discovery Tool Using Stepwise Refinement of Gapped Consensuses , 2011, PloS one.

[33]  J. van Helden,et al.  RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets , 2011, Nucleic acids research.

[34]  Yu Liang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm080 Sequence analysis , 2022 .