Locality and Parallelism Optimization for Dynamic Programming Algorithm in Bioinformatics

Dynamic programming has been one of the most efficient approaches to sequence analysis and structure prediction in biology. However, their performance is limited due to the drastic increase in both the number of biological data and variety of the computer architectures. With regard to such predicament, this paper creates excellent algorithms aimed at addressing the challenges of improving memory efficiency and network latency tolerance for nonserial polyadic dynamic programming where the dependences are nonuniform. By relaxing the nonuniform dependences, we proposed a new cache oblivious scheme to enhance its performance on memory hierarchy architectures. Moreover we develop and extend a tiling technique to parallelize this nonserial polyadic dynamic programming using an alternate block-cyclic mapping strategy for balancing the computational and memory load, where an analytical parameterized model is formulated to determine the tile volume size that minimizes the total execution time and an algorithmic transformation is used to schedule the tile to overlap communication with computation to further minimize communication overhead on parallel architectures. The numerical experiments were carried out on several high performance computer systems. The new cache-oblivious dynamic programming algorithm achieve 2-10 speedup and the parallel tiling algorithm with communication-computation overlapping shows a desired potential for fine-grained parallel computing on massively parallel computer systems

[1]  Tao Li,et al.  Workload characterization of bioinformatics applications , 2005, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[2]  Donald Yeung,et al.  BioBench: A Benchmark Suite of Bioinformatics Applications , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[3]  Katherine Yelick,et al.  UPC Language Specifications V1.1.1 , 2003 .

[4]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[5]  Greg J. Regnier,et al.  The Virtual Interface Architecture , 2002, IEEE Micro.

[6]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[7]  Sartaj Sahni,et al.  A blocked all-pairs shortest-paths algorithm , 2003, ACM J. Exp. Algorithmics.

[8]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[9]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[10]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[11]  Victor Eijkhout,et al.  Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[12]  Jingling Xue Communication-Minimal Tiling of Uniform Dependence Loops , 1997, J. Parallel Distributed Comput..

[13]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[14]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[15]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[16]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2004, Proceedings 16th International Parallel and Distributed Processing Symposium.

[17]  Alan George,et al.  Dynamic Programming on a Shared-Memory Multiprocessor , 1993, Parallel Comput..

[18]  Bruce A. Shapiro,et al.  Optimization of an RNA Folding Algorithm for Parallel Architectures , 1998, Parallel Comput..

[19]  Wu-chun Feng,et al.  The Quadrics Network: High-Performance Clustering Technology , 2002, IEEE Micro.

[20]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[21]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[22]  Francisco Almeida,et al.  Optimal tiling for the RNA base pairing problem , 2002, SPAA '02.

[23]  Sanjay V. Rajopadhye,et al.  Optimal Orthogonal Tiling of 2-D Iterations , 1997, J. Parallel Distributed Comput..

[24]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local , 1995 .

[25]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[26]  Jingling Xue,et al.  Reuse-Driven Tiling for Improving Data Locality , 1998, International Journal of Parallel Programming.

[27]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[28]  David A. Bader,et al.  BioPerf: a benchmark suite to evaluate high-performance computer architecture on bioinformatics applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[29]  Christian N. S. Pedersen,et al.  Fast evaluation of internal loops in RNA secondary structure prediction , 1999, Bioinform..

[30]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[31]  Hiroshi Tezuka,et al.  The design and implementation of zero copy MPI using commodity hardware with a high performance network , 1998, ICS '98.

[32]  Jingling Xue,et al.  On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..

[33]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[34]  Zvi Galil,et al.  Parallel Algorithms for Dynamic Programming Recurrences with More than O(1) Dependency , 1994, J. Parallel Distributed Comput..

[35]  Sanjay V. Rajopadhye,et al.  Optimal semi-oblique tiling , 2001, SPAA '01.

[36]  Dhabaleswar K. Panda,et al.  High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.

[37]  D. Martin Swany,et al.  Transformations to Parallel Codes for Communication-Computation Overlap , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[38]  H. T. Kung,et al.  Direct VLSI Implementation of Combinatorial Algorithms , 1979 .

[39]  Lin Xu,et al.  An experimental study of optimizing bioinformatics applications , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[40]  Jingling Xue,et al.  Unimodular Transformations of Non-Perfectly Nested Loops , 1997, Parallel Comput..

[41]  J. Ramanujam,et al.  Tiling multidimensional iteration spaces for nonshared memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).