A cost-aggregating integer linear program for motif finding

In the motif finding problem one seeks a set of mutually similar substrings within a collection of biological sequences. This is an important and widely-studied problem, as such shared motifs in DNA often correspond to regulatory elements. We study a combinatorial framework where the goal is to find substrings of a given length such that the sum of their pairwise distances is minimized. We describe a novel integer linear program for the problem, which uses the fact that distances between substrings come from a limited set of possibilities allowing for aggregate consideration of sequence position pairs with the same distances. We show how to tighten its linear programming relaxation by adding an exponential set of constraints and give an efficient separation algorithm that can find violated constraints, thereby showing that the tightened linear program can still be solved in polynomial time. We apply our approach to find optimal solutions for the motif finding problem and show that it is effective in practice in uncovering known transcription factor binding sites.

[1]  Hiroki Arimura,et al.  On approximation algorithms for local multiple alignment , 2000, RECOMB '00.

[2]  G. Church,et al.  Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. , 2000, Genome research.

[3]  Mona Singh,et al.  Comparative analysis of methods for representing and searching for transcription factor binding sites , 2004, Bioinform..

[4]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[5]  Mona Singh,et al.  A combinatorial optimization approach for diverse motif finding applications , 2006, Algorithms for Molecular Biology.

[6]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[7]  Bin Ma,et al.  Finding similar regions in many strings , 1999, STOC '99.

[8]  Mona Singh,et al.  A Semidefinite Programming Approach to Side Chain Positioning with New Rounding Strategies , 2004, INFORMS J. Comput..

[9]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[10]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[11]  T. Ibaraki Enumerative approaches to combinatorial optimization - part I , 1988 .

[12]  Eugene L. Lawler,et al.  Approximation Algorithms for Multiple Sequence Alignment , 1994, Theor. Comput. Sci..

[13]  G D Schuler,et al.  A workbench for multiple alignment construction and analysis , 1991, Proteins.

[14]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[15]  Mona Singh,et al.  Solving and analyzing side-chain positioning problems using linear and integer programming , 2005, Bioinform..

[16]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[17]  L. Lovász,et al.  Geometric Algorithms and Combinatorial Optimization , 1981 .

[18]  Mona Singh,et al.  A Compact Mathematical Programming Formulation for DNA Motif Finding , 2006, CPM.

[19]  Marie-France Sagot,et al.  Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification , 2000, J. Comput. Biol..

[20]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[21]  William J. Cook,et al.  Combinatorial optimization , 1997 .

[22]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[23]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[24]  G. Church,et al.  A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. , 1998, Journal of molecular biology.

[25]  Bin Ma,et al.  Finding Similar Regions in Many Sequences , 2002, J. Comput. Syst. Sci..

[26]  Eric C. Rouchka,et al.  Gibbs Recursive Sampler: finding transcription factor binding sites , 2003, Nucleic Acids Res..