A parallel algorithm for the extraction of structured motifs

In this work we propose a parallel algorithm for the efficient extraction of binding-site consensus from genomic sequences. This algorithm, based on an existing approach, extracts structured motifs, that consist of an ordered collection of p ≥ 1 boxes with sizes and spacings between them specified by given parameters. The contents of the boxes, which represent the extracted motifs, are unknown at the start of the process and are found by the algorithm using a suffix tree as the fundamental data structure. By partitioning the structured motif searching space we divide the most demanding part of the algorithm by a number of processors that can be loosely coupled. In this way we obtain, under conditions that are easily met, a speedup that is linear on the number of available processing units. This speedup is verified by both theoretical and experimental analysis, also presented in this paper.

[1]  Marie-France Sagot,et al.  Spelling Approximate Repeated or Common Motifs Using a Suffix Tree , 1998, LATIN.

[2]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[3]  I. Jonassen,et al.  Predicting gene regulatory elements in silico on a genomic scale. , 1998, Genome research.

[4]  Hanah Margalit,et al.  Identification of common motifs in unaligned DNA sequences: application to Escherichia coli Lrp regulon , 1995, Comput. Appl. Biosci..

[5]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[6]  Marie-France Sagot,et al.  Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification , 2000, J. Comput. Biol..

[7]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[9]  Ramesh Hariharan,et al.  Optimal parallel suffix tree construction , 1994, STOC '94.

[10]  Thomas Werner,et al.  Functional promoter modules can be detected by formal models independent of overall nucleotide sequence similarity , 1999, Bioinform..

[11]  G. Stormo,et al.  Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. , 1992, Journal of molecular biology.

[12]  Martin Tompa,et al.  An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem , 1999, ISMB.

[13]  Gad M. Landau,et al.  Parallel construction of a suffix tree with applications , 1988, Algorithmica.

[14]  Malcolm P. Atkinson,et al.  A Database Index to Large Biological Sequences , 2001, VLDB.

[15]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[16]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[17]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[18]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.