The Algebraic Path Problem on the Cell/B.E. Processor

Report Date: Written Language: Any Other Identifying Information of this Report: Distribution Statement: Supplementary Notes: The University of Aizu Aizu-Wakamatsu Fukushima 965-8580 Japan 11/30/2010 English First Issue: 10 copies Kazuya Matsumoto, Stanislav G. Sedukhin The Algebraic Path Problem on the Cell/B.E. Processor algebraic path problem, all-pairs shortest paths problem, Cell Broadband Engine, performance evaluation, parallel computing The Algebraic Path Problem (APP) unifies well-known matrix, graph, and language problems, such as matrix inversion, all-pairs shortest paths (APSP), maximum capacity paths (MCP), minimum spanning tree, generation of regular languages, etc., into a single algorithmic scheme. The difference between APP instances is in the underlying algebraic structure. This paper explores the APP and presents an implementation of a block algorithm for solving the APP on the Cell Broadband Engine (Cell/B.E.) heterogeneous multicore processor. The block APP algorithm spends the most computing time in a block matrix-matrix multiply-add (MMA) operation in different algebras. In our APP algorithm, a fast dense MMA operation in linear (+,×)-algebra is utilized. The MMA implementation on the Cell/B.E. needs only a single fused multiply-add (FMA) instruction to obtain a single short-vector (+,×)-result in one cycle. APP instances such as APSP and MCP problems are based on (min, +)and (max, min)-algebras, respectively, which are different from the linear (+,×)-algebra, and require three and four instructions to obtain a single short-vector result in three and four cycles. Because of that, the maximum sustained performance for MMA operation on Cell/B.E. is 152 Gflop/s whereas for APSP and MCP are 50.7 Gflop/s and 38.1 Gflop/s, respectively. Manuscript submitted to Journal of Information Processing (IPSJJIP) Distributed Parallel Processing Laboratory The Algebraic Path Problem on the Cell/B.E. Processor Kazuya Matsumoto Stanislav G. Sedukhin

[1]  Claude Tadonki Ring Pipelined Algorithm for the Algebraic Path Problem on the CELL Broadband Engine , 2010, 2010 22nd International Symposium on Computer Architecture and High Performance Computing Workshops.

[2]  A. Ya. Rodionov,et al.  Universal algorithms, mathematics of semirings and parallel computations , 2010, ArXiv.

[3]  Viktor K. Prasanna,et al.  Analysis of memory hierarchy performance of block data layout , 2002, Proceedings International Conference on Parallel Processing.

[4]  Fumihiko Ino,et al.  A Task Parallel Algorithm for Computing the Costs of All-Pairs Shortest Paths on the CUDA-Compatible GPU , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[5]  Jack J. Dongarra,et al.  Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor , 2009, Parallel Comput..

[6]  Stanislav G. Sedukhin,et al.  Matrix Inversion on the Cell/B.E. Processor , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[7]  Joseph T. Kider,et al.  All-pairs shortest-paths for large graphs on the GPU , 2008, GH '08.

[8]  John R. Gilbert,et al.  Solving path problems on the GPU , 2010, Parallel Comput..

[9]  Sartaj Sahni,et al.  A blocked all-pairs shortest-paths algorithm , 2003, ACM J. Exp. Algorithmics.

[10]  Sang H. Dhong,et al.  The vector floating-point unit in a synergistic processor element of a CELL processor , 2005, 17th IEEE Symposium on Computer Arithmetic (ARITH'05).

[11]  Stanislav G. Sedukhin,et al.  Design and analysis of systolic algorithms for the algebraic path problem , 1992 .

[12]  Toshiaki Miyazaki,et al.  Orbital Systolic Algorithms and Array Processors for Solution of the Algebraic Path Problem , 2010, IEICE Trans. Inf. Syst..

[13]  G. Rote Path Problems in Graphs , 1990 .

[14]  Ceren Budak,et al.  Gaussian Elimination Based Algorithms on the GPU , 2008 .

[15]  Daniel J. Lehmann,et al.  Algebraic Structures for Transitive Closure , 1976, Theor. Comput. Sci..

[16]  Eugene Fink A survey of sequential and systolic algorithms for the algebraic path problem , 1992 .

[17]  Bruce M. Maggs,et al.  Minimum-Cost Spanning Tree as a Path-Finding Problem , 1988, Inf. Process. Lett..

[18]  Mehryar Mohri,et al.  Semiring Frameworks and Algorithms for Shortest-Distance Problems , 2002, J. Autom. Lang. Comb..

[19]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[20]  伊野 文彦,et al.  Fast Blocked Floyd-Warshall Algorithm on the GPU , 2010 .

[21]  Jack J. Dongarra,et al.  The PlayStation 3 for High-Performance Scientific Computing , 2008, Computing in Science & Engineering.

[22]  Eric Stahlberg,et al.  Hardware/Software Integration for FPGA-based All-Pairs Shortest-Paths , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[23]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2004, Proceedings 16th International Parallel and Distributed Processing Symposium.

[24]  Sanjay V. Rajopadhye,et al.  The Algebraic Path Problem Revisited , 1999, Euro-Par.

[25]  G. Rote A systolic array algorithm for the algebraic path problem (shortest paths; Matrix inversion) , 1985, Computing.

[26]  Vijay K. Garg,et al.  Optimization of BLAS on the cell processor , 2008, HiPC'08.

[27]  Paulius Micikevicius,et al.  General Parallel Computation on Commodity Graphics Hardware: Case Study with the All-Pairs Shortest Paths Problem , 2004, PDPTA.

[28]  Jack J. Dongarra,et al.  Implementation of mixed precision in solving systems of linear equations on the Cell processor , 2007, Concurr. Comput. Pract. Exp..

[29]  Jason N. Dale,et al.  Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..

[30]  Franz Franchetti,et al.  Program generation for the all-pairs shortest path problem , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[31]  Viktor K. Prasanna,et al.  Transitive closure on the cell broadband engine: A study on self-scheduling in a multicore processor , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[32]  Viktor K. Prasanna,et al.  Tiling, Block Data Layout, and Memory Hierarchy Performance , 2003, IEEE Trans. Parallel Distributed Syst..

[33]  Uday Bondhugula,et al.  Parallel FPGA-based all-pairs shortest-paths in a directed graph , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[34]  Stanislav G. Sedukhin,et al.  A Solution of the All-Pairs Shortest Paths Problem on the Cell Broadband Engine Processor , 2009, IEICE Trans. Inf. Syst..

[35]  Leonid Oliker,et al.  Memory-intensive benchmarks: IRAM vs. cache-based machines , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.