On the random access performance of Cell Broadband Engine with graph analysis application

The Cell Broad Engine (BE) Processor has unique memory access architecture besides its powerful computing engines. Many computing-intensive applications have been ported to Cell/BE successfully. But memory-intensive applications are rarely investigated except for several micro benchmarks. Since Cell/BE has powerful software visible DMA engine, this paper studies on whether Cell/BE is suit for applica- tions with large amount of random memory accesses. Two benchmarks, GUPS and SSCA#2, are used. The latter is a rather complex one that in representative of real world graph analysis applications. We find both benchmarks have good performance on Cell/BE based IBM QS20/22. Com- pared with 2 conventional multi-processor systems with the same core/thread number, GUPS is about 40-80% fast and SSCA#2 about 17-30% fast. The dynamic load balanc- ing and software pipeline for optimizing SSCA#2 are intro- duced. Based on the experiment, the potential of Cell/BE for random access is analyzed in detail as well as its limita- tions of memory controller, atomic engine and TLB manage- ment.Our research shows although more programming effort are needed, Cell/BE has the potencial for irregular memory access applications.

[1]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[2]  David A. Bader,et al.  On the Design and Analysis of Irregular Algorithms on the Cell Processor: A Case Study of List Ranking , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[3]  David A. Bader,et al.  FFTC: Fastest Fourier Transform for the IBM Cell Broadband Engine , 2007, HiPC.

[4]  Guang R. Gao,et al.  Experience on optimizing irregular computation for memory hierarchy in manycore architecture , 2008, ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming.

[5]  Fabrizio Petrini,et al.  Cell Multiprocessor Communication Network: Built for Speed , 2006, IEEE Micro.

[6]  Tu Deng-Biao,et al.  Fine-Grained Parallel Betweenness Centrality Algorithm Without Lock Synchronization , 2011 .

[7]  David A. Bader,et al.  On the architectural requirements for efficient execution of graph algorithms , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[8]  David A. Bader,et al.  Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[9]  Fabrizio Petrini,et al.  Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[10]  David A. Bader,et al.  Parallel Algorithms for Evaluating Centrality Indices in Real-world Networks , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[11]  David A. Bader Petascale Computing: Algorithms and Applications , 2007 .

[12]  David A. Bader,et al.  Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors , 2005, HiPC.