Tsukuba BB: A Branch and Bound Algorithm for Local Multiple Alignment of DNA and Protein Sequences

In this paper we present a branch and bound algorithm for local gapless multiple sequence alignment (motif alignment) and its implementation. The algorithm uses both score-based bounding and a novel bounding technique based on the "consistency" of the alignment. A sequence order independent search tree is used in conjunction with a technique for avoiding redundant calculations inherent in the structure of the tree. This is the first program to exploit the fact that the motif alignment problem is easier for short motifs. Indeed, for a short fixed motif width, the running time of the algorithm is asymptotically linear in the size of the input. We tested the performance of the program on a dataset of 300 E. coli promoter sequences and a dataset of 85 lipocalin protein sequences. For a motif width of 4, the optimal alignment of the entire set of sequences can be found. For the more natural motif width of 6, the program can align 21 sequences of length 100, more than twice the number of sequences which can be aligned by the best previous exact algorithm. The algorithm can relax the constraint of requiring each sequence to be aligned, and align 105 of the 300 promoter sequences with a motif width of 6. For the lipocalin dataset, we introduce a technique for reducing the effective alphabet size with a minimal loss of useful information. With this technique, we show that the program can find meaningful motifs in a reasonable amount of time by optimizing the score over three motif positions.

[1]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[2]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[3]  Martin Tompa,et al.  An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem , 1999, ISMB.

[4]  H. Margalit,et al.  Compilation of E. coli mRNA promoter sequences. , 1993, Nucleic acids research.

[5]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[6]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[7]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[8]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[9]  P Horton A branch and bound algorithm for local multiple alignment. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[10]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[11]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[12]  Hiroki Arimura,et al.  On approximation algorithms for local multiple alignment , 2000, RECOMB '00.

[13]  Bin Ma,et al.  Finding similar regions in many strings , 1999, STOC '99.

[14]  H. Toh,et al.  Glutathione-independent prostaglandin D synthase as a lead molecule for designing new functional proteins. , 1996, Protein engineering.

[15]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.