External Memory Best-First Search for Multiple Sequence Alignment

Multiple sequence alignment (MSA) is a central problem in computational biology. It is well known that MSA can be formulated as a shortest path problem and solved using heuristic search, but the memory requirement of A* makes it impractical for all but the smallest problems. Partial Expansion A* (PEA*) reduces the memory requirement of A* by generating only the most promising successor nodes. However, even PEA* exhausts available memory on many problems. Another alternative is Iterative Deepening Dynamic Programming, which uses an uninformed search order but stores only the nodes along the search frontier. However, it too cannot scale to the largest problems. In this paper, we propose storing nodes on cheap and plentiful secondary storage. We present a new general-purpose algorithm, Parallel External PEA* (PE2A*), that combines PEA* with Delayed Duplicate Detection to take advantage of external memory and multiple processors to solve large MSA problems. In our experiments, PE2A* is the first algorithm capable of solving the entire Reference Set 1 of the standard BAliBASE benchmark using a biologically accurate cost function. This work suggests that external best-first search can effectively use heuristic information to surpass methods that rely on uninformed search orders.

[1]  S. Schroedl An Improved Search Algorithm for Optimal Multiple-Sequence Alignment , 2005, J. Artif. Intell. Res..

[2]  Jonathan Schaeffer,et al.  Fringe Search: Beating A* at Pathfinding on Game Maps , 2005, CIG.

[3]  Wheeler Ruml,et al.  Implementing Fast Heuristic Search Code , 2012, SOCS.

[4]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[5]  José Nelson Amaral,et al.  Sequential and Parallel Algorithms for Frontier A* with Delayed Duplicate Detection , 2006, AAAI.

[6]  J. Schmidhuber,et al.  Frontier Search , 2009 .

[7]  Eric A. Hansen,et al.  Structured Duplicate Detection in External-Memory Graph Search , 2004, AAAI.

[8]  S. Altschul Gap costs for multiple sequence alignment. , 1989, Journal of theoretical biology.

[9]  Wheeler Ruml,et al.  Heuristic Search for Large Problems With Real Costs , 2011, AAAI.

[10]  Stefan Edelkamp,et al.  Externalizing the Multiple Sequence Alignment Problem with Affine Gap Costs , 2007, KI.

[11]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[12]  Knut Reinert,et al.  The Practical Use of the A* Algorithm for Exact Multiple Sequence Alignment , 2000, J. Comput. Biol..

[13]  Richard E. Korf Delayed Duplicate Detection: Extended Abstract , 2003, IJCAI.

[14]  Nathan R. Sturtevant,et al.  Partial-Expansion A* with Selective Node Generation , 2012, SOCS.

[15]  Eric A. Hansen,et al.  Sweep A: space-efficient heuristic search in partially ordered graphs , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[16]  P. P. Chakrabarti,et al.  Reducing Reexpansions in Iterative-Deepening Search by Controlling Cutoff Bounds , 1991, Artif. Intell..

[17]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[18]  Richard E. Korf Research Challenges in Combinatorial Search , 2012, AAAI.

[19]  Kobayashi,et al.  Improvement of the A(*) Algorithm for Multiple Sequence Alignment. , 1998, Genome informatics. Workshop on Genome Informatics.

[20]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[21]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[22]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[23]  Teruhisa Miura,et al.  A* with Partial Expansion for Large Branching Factor Problems , 2000, AAAI/IAAI.

[24]  Richard E. Korf,et al.  Linear-time disk-based implicit graph search , 2008, JACM.

[25]  Richard E. Korf,et al.  Iterative-Deepening-A*: An Optimal Admissible Tree Search , 1985, IJCAI.

[26]  Hiroshi Imai,et al.  Enhanced A* Algorithms for Multiple Alignments: Optimal Alignments for Several Sequences and k-Opt Approximate Alignments for Large Cases , 1999, Theoretical Computer Science.

[27]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .