Design and implementation of tool-chain framework to support OpenMP single source compilation on cell platform

The well-known performance and power bottlenecks in traditional architecture design, in conjunction with the sustained demand for high performance in real world applications, stimulated the creation of new designs that utilize multi-cores in one processor. There are two approaches in multi-core design: homogeneous and heterogeneous. The homogeneous design is based on the replication of simple cores. It is easier for developers to port existing applications to this kind of platform. However, great diversity exists among applications and a homogeneous multi-core chip cannot be optimal for heterogeneous workloads. Therefore, more and more multi-core designs tend to utilize heterogeneous cores and specialized accelerators. The Cell Broadband Engine (CBE) [1, 2] is a representative. Every Cell processor integrates one Power Processing Engine(PPE) core and eight Synergistic Processing Engines (SPE). A PPE core has a traditional memory and cache hierarchy and it accesses memory via caching mechanism. On the other hand, each SPE core only has 256K local storage and accesses its own local storage directly. All data exchange between the SPUs and shared system memory is via high-latency DMA operation. Therefore, this architecture presents great challenges to programmers who want to utilize parallelism: (1) Threads running on PPE are far different from the ones on SPEs, both in capability and ISAs. The users have to take care of programming threads in each ISA, as well as their cooperation and synchronization. (2) The SPE local storage is so limited that SPE code or data may have to be partitioned into overlay sections. Given the explicit memory hierarchy, it is necessary for the programmers to issue DMA instructions at the appropriate time and transfer

[1]  Lionel Lacassagne,et al.  Parallelization schemes for memory optimization on the cell processor: a case study of image processing algorithm , 2007, MEDEA '07.

[2]  Samuel Williams,et al.  The potential of the cell processor for scientific computing , 2005, CF '06.

[3]  Allen D. Malony,et al.  Supporting Nested OpenMP Parallelism in the TAU Performance System , 2006, IWOMP.

[4]  James A. Kahle,et al.  The Cell Processor Architecture , 2005, MICRO.

[5]  ChenT.,et al.  Cell Broadband Engine Architecture and its first implementation—A view , 2007 .

[6]  Fabrizio Petrini,et al.  Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[7]  Alexandros Stamatakis,et al.  Dynamic multigrain parallelization on the cell broadband engine , 2007, PPoPP.

[8]  Michael Gschwind The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor , 2007, International Journal of Parallel Programming.

[9]  Michael Gschwind Chip multiprocessing and the cell broadband engine , 2006, CF '06.

[10]  Wolfgang Karl,et al.  CMP Cache Architecture and the OpenMP Performance , 2007, IWOMP.

[11]  James D. Warnock,et al.  Cell processor low-power design methodology , 2005, IEEE Micro.

[12]  Benedict R. Gaster,et al.  Exploiting Loop-Level Parallelism for SIMD Arrays Using OpenMP , 2007, IWOMP.

[13]  Jay Hoeflinger Programming with cluster openMP , 2007, PPOPP.

[14]  Yi Jiang,et al.  Toward an Automatic Code Layout Methodology , 2007, IWOMP.

[15]  Mats Brorsson,et al.  A free openmp compiler and run-time library infrastructure for research on shared memory parallel computing , 2004 .

[16]  David A. Bader,et al.  FFTC: Fastest Fourier Transform for the IBM Cell Broadband Engine , 2007, HiPC.

[17]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[18]  I. Wald,et al.  Ray Tracing on the Cell Processor , 2006, 2006 IEEE Symposium on Interactive Ray Tracing.

[19]  Alexandros Stamatakis,et al.  RAxML-Cell: Parallel Phylogenetic Tree Inference on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[20]  Venkatesan Packirisamy,et al.  OpenMP in Multicore Architectures , 2005 .

[21]  Tao Zhang,et al.  Supporting OpenMP on Cell , 2008, International Journal of Parallel Programming.

[22]  Jonas Larsson,et al.  Space Time Adaptive Processing Estimates for IBM/Sony/Toshiba Cell Broadband Engine Processor , 2006, 2006 International Radar Symposium.

[23]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[24]  Fabrizio Petrini,et al.  Cell Multiprocessor Communication Network: Built for Speed , 2006, IEEE Micro.

[25]  Michael Gschwind,et al.  Optimizing Compiler for the CELL Processor , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[26]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[27]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[28]  Fabrizio Petrini,et al.  Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[29]  Haitao Wei,et al.  Loading OpenMP to Cell: An Effective Compiler Framework for Heterogeneous Multi-core Chip , 2007, IWOMP.

[30]  Guang R. Gao,et al.  Performance Characteristics of OpenMP Language Constructs on a Many-core-on-a-chip Architecture , 2006, IWOMP.

[31]  Sadaf R. Alam,et al.  Balancing productivity and performance on the cell broadband engine , 2007, 2007 IEEE International Conference on Cluster Computing.

[32]  Guang R. Gao,et al.  Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip , 2006, CF '06.

[33]  David A. Bader,et al.  On the Design and Analysis of Irregular Algorithms on the Cell Processor: A Case Study of List Ranking , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[34]  José E. Moreira,et al.  Dissecting Cyclops: a detailed analysis of a multithreaded architecture , 2003, CARN.

[35]  Laurent Amsaleg,et al.  Parallelization of a Hierarchical Data Clustering Algorithm Using OpenMP , 2006, IWOMP.

[36]  Santosh G. Abraham,et al.  Chip multithreading: opportunities and challenges , 2005, 11th International Symposium on High-Performance Computer Architecture.