We show how a numerical simulation method for nonlinear hyperbolic partial differential equation (PDE) systems on structured grids with explicit timestepping can be implemented efficiently for the Cell processor and for clusters of Cell processors. We describe memory layout, communication patterns and optimization steps that are performed to exploit the parallel architecture of the Cell processor. A second layer of Message Passing Interface (MPI) parallelism is added to obtain a hybrid parallel code that can be executed efficiently on Cell clusters. Performance tests are conducted on a Cell cluster, and the Cell performance is compared with x86 performance (Xeon). Compared with single-core Xeon performance, the Cell processor obtains significant speed-ups of 60x for single precision calculations, and 20x for double precision. In a chip-to-chip comparison, the Cell code is 14x faster than a 4-core Xeon (using pthreads) in single precision, and 5x faster in double precision. Parallel cluster scaling results were hampered by a relatively slow interconnect on our test system, but overall our study shows how Cell clusters can be used efficiently for simulating nonlinear hyperbolic PDE systems.
[1]
Scott Pakin,et al.
Entering the petaflop era: the architecture and performance of Roadrunner
,
2008,
HiPC 2008.
[2]
Matthew Scarpino,et al.
Programming the Cell Processor: For Games, Graphics, and Computation
,
2008
.
[3]
Ibm Redbooks,et al.
Programming the Cell Broadband Engine Architecture: Examples and Best Practices
,
2008
.
[4]
M. Suzuoki,et al.
Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor
,
2006,
IEEE Journal of Solid-State Circuits.
[5]
R. LeVeque.
Finite Volume Methods for Hyperbolic Problems: Characteristics and Riemann Problems for Linear Hyperbolic Equations
,
2002
.