Making the Best of Temporal Locality: Just-in-Time Renaming and Lazy Write-Back on the Cell/B.E

Cell Superscalar (CellSs) provides a simple, flexible and easy programming approach for the Cell Broadband Engine (Cell/B.E.) that automatically exploits the inherent concurrency of applications at a function or task level. The CellSs environment is based on a source-to-source compiler that translates annotated C or Fortran code and a runtime library tailored for the Cell/B.E. that orchestrates the concurrent execution of the application. We introduce a technique called bypassing that allows CellSs to perform core-to-core Direct Memory Access (DMA) transfers for generic applications. In this review we concisely summarize the bypassing practice and introduce two improvements: just-in-time renaming and lazy write-back. These extensions come at no additional cost and potentially increase performance by improving the perceived bandwidth of the Element Interconnect Bus (EIB). Experiments on five fundamental linear algebra kernels demonstrate the applicability of these techniques and quantify the benefit that can be reaped. We also present performance results for a first prototype of CellSs with bypassing.

[1]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[2]  David A. Bader,et al.  High performance combinatorial algorithm design on the Cell Broadband Engine processor , 2007, Parallel Comput..

[3]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[4]  Timothy Mark Pinkston,et al.  Characterizing the Cell EIB On-Chip Network , 2007, IEEE Micro.

[5]  富田 眞治 20世紀の名著名論:R. M. Tomasulo : An Efficient Algorithm for Exploiting Multiple Arithmetic Units , 2004 .

[6]  Timothy Mark Pinkston,et al.  On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[7]  Maurice V. Wilkes,et al.  The memory wall and the CMOS end-point , 1995, CARN.

[8]  Won-Taek Lim,et al.  Effective Management of DRAM Bandwidth in Multicore Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[9]  Eduard Ayguadé,et al.  Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..

[10]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[11]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[12]  Jesús Labarta,et al.  CellSs: Making it easier to program the Cell Broadband Engine processor , 2007, IBM J. Res. Dev..

[13]  A. WulfWm.,et al.  Hitting the memory wall , 1995 .

[14]  Michael Gschwind,et al.  Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture , 2006, IBM Syst. J..

[15]  Daniel Jiménez-González,et al.  Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[16]  Alexandros Stamatakis,et al.  Dynamic multigrain parallelization on the cell broadband engine , 2007, PPoPP.

[17]  Gerhard Goos A Programming Example , 1983 .