K-group A* for multiple sequence alignment with quasi-natural gap costs

Alignment of multiple protein or DNA sequences is an important problem in bioinformatics. Previous work has shown that the A* search algorithm can find optimal alignments for up to several sequences, and that a K-group generalization of A* can find approximate alignments for much larger numbers of sequences [T. Ikeda et al. (1999)]. In this paper, we describe the first implementation of K-group A* that uses quasinatural gap costs, the cost model used in practice by biologists. We also introduce a new method for computing gap-opening costs in profile alignment. Our results show that K-group A* can efficiently find optimal or close-to-optimal alignments for small groups of sequences, and, for large numbers of sequences, it can find higher-quality alignments than the widely-used CLUSTAL family of approximate alignment tools. This demonstrates the benefits of A* in aligning large numbers of sequences, as typically compared by biologists, and suggests that K-group A* could become a practical tool for multiple sequence alignment.

[1]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[2]  Peter J. Munson,et al.  A novel randomized iterative strategy for aligning multiple protein sequences , 1991, Comput. Appl. Biosci..

[3]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[4]  Richard E. Korf,et al.  Divide-and-Conquer Frontier Search Applied to Optimal Sequence Alignment , 2000, AAAI/IAAI.

[5]  Jonathan Schaeffer,et al.  Memory-efficient A* heuristics for multiple sequence alignment , 2002, AAAI/IAAI.

[6]  Sandeep K. Gupta,et al.  Improving the Practical Space and Time Efficiency of the Shortest-Paths Approach to Sum-of-Pairs Multiple Sequence Alignment , 1995, J. Comput. Biol..

[7]  Eric A. Hansen,et al.  Space-Efficient Memory-Based Heuristics , 2004, AAAI.

[8]  Hiroshi Imai,et al.  Enhanced A* Algorithms for Multiple Alignments: Optimal Alignments for Several Sequences and k-Opt Approximate Alignments for Large Cases , 1999, Theoretical Computer Science.

[9]  Knut Reinert,et al.  The Practical Use of the A* Algorithm for Exact Multiple Sequence Alignment , 2000, J. Comput. Biol..

[10]  Jens Stoye,et al.  An iterative method for faster sum-of-pairs multiple sequence alignment , 2000, Bioinform..

[11]  S. Altschul Gap costs for multiple sequence alignment. , 1989, Journal of theoretical biology.

[12]  Winfried Just,et al.  Computational Complexity of Multiple Sequence Alignment with SP-Score , 2001, J. Comput. Biol..

[13]  Teruhisa Miura,et al.  A* with Partial Expansion for Large Branching Factor Problems , 2000, AAAI/IAAI.

[14]  Rainer Fuchs,et al.  CLUSTAL V: improved software for multiple sequence alignment , 1992, Comput. Appl. Biosci..

[15]  Eric A. Hansen,et al.  Graph Embedding with Constraints , 2009, IJCAI.

[16]  Osamu Gotoh,et al.  Optimal alignment between groups of sequences and its application to multiple sequence alignment , 1993, Comput. Appl. Biosci..