An Efficient Combinatorial Approach for Solving the DNA Motif Finding Problem

The detection of an over-represented sub-sequence in a set of (carefully chosen) DNA sequences is often the main clue leading to the investigation of a possible functional role for such a subsequence. Over-represented substrings (with possibly local mutations) in a biological string are termed motifs. A typical functional unit that can be modeled by a motif is a Transcription Factor Binding Site (TFBS), a portion of the DNA sequence apt to the binding of a protein that participates in complex transcriptomic biochemical reactions. In the literature it has been proposed a simplified combinatorial problem called the planted (l-d)-motif problem (known also as the (l-d) Challenge Problem) that captures the essential combinatorial nature of the motif finding problem. In this paper we propose a novel graph-based algorithm for solving a refinement of the (l-d) Challenge Problem. Experimental results show that instances of the (l-d) Challenge Problem considered difficult for competing state of the art methods in literature can be solved efficiently in our framework.

[1]  Christina Boucher,et al.  A Graph Clustering Approach to Weak Motif Recognition , 2007, WABI.

[2]  G. K. Sandve,et al.  A survey of motif discovery methods in an integrated framework , 2006, Biology Direct.

[3]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[4]  Finn Drabløs,et al.  Improved benchmarks for computational motif discovery , 2007, BMC Bioinformatics.

[5]  David J. Arenillas,et al.  In Silico Detection of Sequence Variations Modifying Transcriptional Regulation , 2007, PLoS Comput. Biol..

[6]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[7]  Charles DeLisi,et al.  In silico regulatory analysis for exploring human disease progression , 2008, Biology Direct.

[8]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[9]  Graziano Pesole,et al.  In silico representation and discovery of transcription factor binding sites , 2004, Briefings Bioinform..

[10]  Uri Keich,et al.  Finding motifs in the twilight zone , 2002, Bioinform..

[11]  Marie-France Sagot,et al.  Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification , 2000, RECOMB '00.

[12]  Sanguthevar Rajasekaran,et al.  Exact Algorithms for Planted Motif Problems , 2005, J. Comput. Biol..

[13]  Sanguthevar Rajasekaran,et al.  On the Challenging Instances of the Planted Motif Problem ? , 2006 .

[14]  Finn Drabløs,et al.  Assessment of composite motif discovery methods , 2008, BMC Bioinformatics.

[15]  Jaime I. Dávila,et al.  Fast and Practical Algorithms for Planted (l, d) Motif Search , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Jianer Chen,et al.  Integrating Sample-Driven and Pattern-Driven Approaches in Motif Finding , 2004, WABI.

[17]  Shoudan Liang,et al.  cWINNOWER algorithm for finding fuzzy DNA motifs , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[18]  Uri Keich,et al.  Finding motifs in the twilight zone , 2002, RECOMB '02.

[19]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.