PA-Star: A disk-assisted parallel A-Star strategy with locality-sensitive hash for multiple sequence alignment

Abstract Multiple Sequence Alignment (MSA) is a basic operation in Bioinformatics, and is used to highlight the similarities among a set of sequences. The MSA problem was proven NP-Hard, thus requiring a high amount of memory and computing power. This problem can be modeled as a search for the path with minimum cost in a graph, and the A-Star algorithm has been adapted to solve it sequentially and in parallel. The design of a parallel version for MSA with A-Star is subject to challenges such as irregular dependency pattern and substantial memory requirements. In this paper, we propose PA-Star, a locality-sensitive multithreaded strategy based on A-Star, which computes optimal MSAs using both RAM and disk to store nodes. The experimental results obtained in 3 different machines show that the optimizations used in PA-Star can achieve an acceleration of 1.88 × in the serial execution, and the parallel execution can attain an acceleration of 5.52 × with 8 cores. We also show that PA-Star outperforms a state-of-the-art MSA tool based on A-Star, executing up to 4.77 × faster. Finally, we show that our disk-assisted strategy is able to retrieve the optimal alignment when other tools fail.

[1]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[2]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[3]  José Nelson Amaral,et al.  Sequential and Parallel Algorithms for Frontier A* with Delayed Duplicate Detection , 2006, AAAI.

[4]  J. Spouge Speeding up dynamic programming algorithms for finding optimal lattice paths , 1989 .

[5]  Scott Meyers,et al.  Effective STL: 50 Specific Ways to Improve Your Use of the Standard Template Library , 2001 .

[6]  Jens Stoye,et al.  Combining Divide-and-Conquer, the A*-Algorithm, and Successive Realignment Approaches to Speed Multiple Sequence Alignment , 1999, German Conference on Bioinformatics.

[7]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[8]  Wheeler Ruml,et al.  Best-First Heuristic Search for Multi-Core Machines , 2009, IJCAI.

[9]  Alba Cristina Magalhaes Alves de Melo,et al.  MSA-GPU: Exact Multiple Sequence Alignment Using GPU , 2013, BSB.

[10]  Teruhisa Miura,et al.  A* with Partial Expansion for Large Branching Factor Problems , 2000, AAAI/IAAI.

[11]  Kobayashi,et al.  Improvement of the A(*) Algorithm for Multiple Sequence Alignment. , 1998, Genome informatics. Workshop on Genome Informatics.

[12]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[13]  Akihiro Kishimoto,et al.  Evaluation of a simple, scalable, parallel best-first search strategy , 2013, Artif. Intell..

[14]  Nathan R. Sturtevant,et al.  External Memory Bidirectional Search , 2016, IJCAI.

[15]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[16]  Knut Reinert,et al.  The Practical Use of the A* Algorithm for Exact Multiple Sequence Alignment , 2000, J. Comput. Biol..

[17]  Wheeler Ruml,et al.  External Memory Best-First Search for Multiple Sequence Alignment , 2013, AAAI.

[18]  Richard E. Korf Delayed Duplicate Detection: Extended Abstract , 2003, IJCAI.

[19]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[20]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[21]  Sergey Yekhanin,et al.  Towards 3-query locally decodable codes of subexponential length , 2008, JACM.

[22]  Eric A. Hansen,et al.  Sweep A: space-efficient heuristic search in partially ordered graphs , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[23]  Alex S. Fukunaga,et al.  Abstract Zobrist Hashing: An Efficient Work Distribution Method for Parallel Best-First Search , 2016, AAAI.

[24]  Richard E. Korf,et al.  Linear-time disk-based implicit graph search , 2008, JACM.

[25]  Richard E. Korf,et al.  Large-Scale Parallel Breadth-First Search , 2005, AAAI.