Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip

This paper presents our experience mapping OpenMP parallel programming model to the IBM Cyclops-64 (C64) architecture. The C64 employs a many-core-on-a-chip design that integrates processing logic (160 thread units), embedded memory (5MB) and communication hardware on the same die. Such a unique architecture presents new opportunities for optimization. Specifically, we consider the following three areas: (1) a memory aware runtime library that places frequently used data structures in scratchpad memory; (2) a unique spin lock algorithm for shared memory synchronization based on in-memory atomic instructions and native support for thread level execution; (3) a fast barrier that directly uses C64 hardware support for collective synchronization. All three optimizations together, result in an 80% overhead reduction for language constructs in OpenMP. We believe that such a drastic reduction in the cost of managing parallelism makes OpenMP more amenable for writing parallel programs on the C64 platform.

[1]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[2]  Larry Rudolph,et al.  Dynamic decentralized cache schemes for mimd parallel processors , 1984, ISCA '84.

[3]  Eduard Ayguadé,et al.  Optimizing NANOS OpenMP for the IBM Cyclops multithreaded architecture , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[4]  G. Gao,et al.  FAST : A Functionally Accurate Simulation Toolset for the Cyclops 64 Cellular Architecture , 2005 .

[5]  Maged M. Michael Hazard pointers: safe memory reclamation for lock-free objects , 2004, IEEE Transactions on Parallel and Distributed Systems.

[6]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[7]  Maged M. Michael,et al.  High performance dynamic lock-free hash tables and list-based sets , 2002, SPAA '02.

[8]  Barbara M. Chapman,et al.  Performance Comparisons of Basic OpenMP Constructs , 2002, ISHPC.

[9]  Timothy L. Harris,et al.  A Pragmatic Implementation of Non-blocking Linked-Lists , 2001, DISC.

[10]  Dennis Shasha,et al.  Concurrent set manipulation without locking , 1988, PODS '88.

[11]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[12]  Rudolf Berrendorf,et al.  Performance characteristics for OpenMP constructs on different parallel computer architectures , 2000 .

[13]  Maurice Herlihy,et al.  Nonblocking memory management support for dynamic-sized data structures , 2005, TOCS.

[14]  J. M. Bull,et al.  Measuring Synchronisation and Scheduling Overheads in OpenMP , 2007 .

[15]  Shreekant S. Thakkar,et al.  Synchronization algorithms for shared-memory multiprocessors , 1990, Computer.

[16]  José E. Moreira,et al.  Demonstrating the Scalability of a Molecular Dynamics Application on a Petaflops Computer , 2002, International Journal of Parallel Programming.

[17]  Nir Shavit,et al.  A scalable lock-free stack algorithm , 2004, SPAA '04.

[18]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[19]  David R. Cheriton,et al.  Non-blocking synchronization and system design , 1999 .

[20]  Mitsuhisa Sato,et al.  Performance Evaluation of the Omni OpenMP Compiler , 2000, ISHPC.

[21]  Guang R. Gao,et al.  Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture , 2006, 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment (HPCS'06).

[22]  Ying Qian,et al.  Performance characteristics of openMP constructs, and application benchmarks on a large symmetric multiprocessor , 2003, ICS '03.

[23]  Eduard Ayguadé,et al.  Evaluation of OpenMP for the Cyclops Multithreaded Architecture , 2003, WOMPAT.

[24]  Maged M. Michael CAS-Based Lock-Free Algorithm for Shared Deques , 2003, Euro-Par.

[25]  John D. Valois Lock-free linked lists using compare-and-swap , 1995, PODC '95.

[26]  Sanjeev Kumar,et al.  Evaluating synchronization on shared address space multiprocessors: methodology and performance , 1999, SIGMETRICS '99.