Optimized Mapping of Pipelined Task Graphs on the Cell BE

Limited bandwidth to off-chip main memory poses a problem in chip multiprocessors for streaming applications, such as Cell BE, and will become more severe with the expected increase in the number of cores. Especially for streaming computations where the ratio between computational work and mem- ory transfer is low, the generation of memory-efficient code is thus an important compiler optimization. We suggest to use pipelining between the SPEs over the high-bandwidth inter- nal bus of Cell BE to reduce the required main memory bandwidth, and thereby improve the computation throughput for memory-intensive computations. At the same time, we are constrained by the limited size of SPE on-chip memory avail- able for additional buffers that are necessary for the pipelining between SPEs. We investigate mappings of the nodes of a pipelined parallel task graph to the SPEs that are optimal trade-offs between load balancing, buffer memory con- sumption, and communication load on the on-chip bus. We solve this multi- objective optimization problem by deriving an integer linear programming (ILP) formulation and compute Pareto-optimal solutions for the mapping with a state- of-the-art ILP solver. For larger problem instances, we sketch a two-step approach to reduce problem size. We exemplify our mapping technique with several memory-intensive example problems: with acyclic pipelined task graphs derived from data parallel code, with complete d-ary tree pipelines for parallel mergesort on Cell BE, and with butterfly pipelines for parallel FFT on Cell BE. We validate the mappings with discrete event simulations.

[1]  Rob H. Bisseling,et al.  Parallel scientific computation - a structured approach using BSP and MPI , 2004 .

[2]  Christoph W. Kessler,et al.  Optimized on-chip pipelining of memory-intensive computations on the cell BE , 2008, CARN.

[3]  Shahid H. Bokhari,et al.  Assignment Problems in Parallel and Distributed Computing , 1987 .

[4]  Christoph W. Kessler,et al.  Optimized Pipelined Parallel Merge Sort on the Cell BE , 2009, Euro-Par Workshops.

[5]  Jason N. Dale,et al.  Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..

[6]  Hong Jiang,et al.  Improved algorithms for partitioning tree and linear task graphs on shared memory architecture , 1994, 14th International Conference on Distributed Computing Systems.

[7]  Shuvra S. Bhattacharyya,et al.  Efficient techniques for clustering and scheduling onto embedded multiprocessors , 2006, IEEE Transactions on Parallel and Distributed Systems.

[8]  S. Lakshmivarahan,et al.  Parallel Sorting Algorithms , 1984, Adv. Comput..

[9]  Peng Wu,et al.  Using advanced compiler technology to exploit the performance of the Cell Broadband Enginee , 2006 .

[10]  Luca Benini,et al.  A Fast and Accurate Technique for Mapping Parallel Applications on Stream-Oriented MPSoC Platforms with Communication Awareness , 2007, International Journal of Parallel Programming.

[11]  Philip S. Yu,et al.  CellSort: High Performance Sorting on the Cell Processor , 2007, VLDB.

[12]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[13]  Nancy M. Amato,et al.  A Comparison of Parallel Sorting Algorithms on Different Architectures , 1998 .

[14]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[15]  Jan M. Rabaey,et al.  Scheduling of DSP programs onto multiprocessors for maximum throughput , 1993, IEEE Trans. Signal Process..

[16]  Luca Benini,et al.  A Constraint Programming Approach for Allocation and Scheduling on the CELL Broadband Engine , 2008, CP.

[17]  Toshio Nakatani,et al.  AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[18]  Christoph W. Kessler,et al.  Scheduling Vector Straight Line Code on Vector Processors , 1991, Code Generation.