An OpenCL framework for heterogeneous multicores with local memory

In this paper, we present the design and implementation of an Open Computing Language (OpenCL) framework that targets heterogeneous accelerator multicore architectures with local memory. The architecture consists of a general-purpose processor core and multiple accelerator cores that typically do not have any cache. Each accelerator core, instead, has a small internal local memory. Our OpenCL runtime is based on software-managed caches and coherence protocols that guarantee OpenCL memory consistency to overcome the limited size of the local memory. To boost performance, the runtime relies on three source-code transformation techniques, work-item coalescing, web-based variable expansion and preload-poststore buffering, performed by our OpenCL C source-to-source translator. Work-item coalescing is a procedure to serialize multiple SPMD-like tasks that execute concurrently in the presence of barriers and to sequentially run them on a single accelerator core. It requires the web-based variable expansion technique to allocate local memory for private variables. Preload-poststore buffering is a buffering technique that eliminates the overhead of software cache accesses. Together with work-item coalescing, it has a synergistic effect on boosting performance. We show the effectiveness of our OpenCL framework, evaluating its performance with a system that consists of two Cell BE processors. The experimental result shows that our approach is promising.

[1]  Milind Girkar,et al.  EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system , 2007, PLDI '07.

[2]  Katherine A. Yelick,et al.  Communication optimizations for fine-grained UPC applications , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[3]  Eduard Ayguadé,et al.  Hybrid access-specific software cache techniques for the cell BE architecture , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Rob van Nieuwpoort,et al.  Using many-core hardware to correlate radio astronomy signals , 2009, ICS.

[5]  Michael Gschwind,et al.  Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture , 2006, IBM Syst. J..

[6]  Ralf S. Engelschall Portable Multithreading-The Signal Stack Trick for User-Space Thread Creation , 2000, USENIX Annual Technical Conference, General Track.

[7]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  References , 1971 .

[9]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[10]  Eduard Ayguadé,et al.  A Novel Asynchronous Software Cache Implementation for the Cell-BE Processor , 2007, LCPC.

[11]  Ling Shao,et al.  DBDB: optimizing DMATransfer for the cell be architecture , 2009, ICS '09.

[12]  Costin Iancu,et al.  HUNTing the overlap , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[13]  Frederica Darema,et al.  The SPMD Model : Past, Present and Future , 2001, PVM/MPI.

[14]  Tao Zhang,et al.  Orchestrating data transfer for the cell/B.E. processor , 2008, ICS '08.

[15]  Avi Mendelson,et al.  Programming model for a heterogeneous x86 platform , 2009, PLDI '09.

[16]  Jungwon Kim,et al.  COMIC: A coherent shared memory interface for cell BE , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[18]  Barbara M. Chapman,et al.  Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.

[19]  Jungwon Kim,et al.  COMIC++: A software SVM system for heterogeneous multicore accelerator clusters , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[20]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[21]  吉野 智興,et al.  Programmer's guide , 1993 .

[22]  Mark D. Hill,et al.  Weak ordering—a new definition , 1998, ISCA '98.

[23]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[24]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[25]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[26]  Holger Scherl Cell Broadband Engine Architecture , 2011 .

[27]  Martin Hopkins,et al.  Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.

[28]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[29]  Ken Kennedy,et al.  A technique for summarizing data access and its use in parallelism enhancing transformations , 1989, PLDI '89.

[30]  Jaejin Lee,et al.  Design and implementation of software-managed caches for multicores with local memory , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[31]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[32]  David Geer Taking the graphics processor beyond graphics , 2005, Computer.