A Study of a Software Cache Implementation of the OpenMP Memory Model for Multicore and Manycore Architectures

This paper is motivated by the desire to provide an efficient and scalable software cache implementation of OpenMP on multicore and manycore architectures in general, and on the IBM CELL architecture in particular. In this paper, we propose an instantiation of the OpenMP memory model with the following advantages: (1) The proposed instantiation prohibits undefined values that may cause problems of safety, security, programming and debugging. (2) The proposed instantiation is scalable with respect to the number of threads because it does not rely on communication among threads or a centralized directory that maintains consistency of multiple copies of each shared variable. (3) The proposed instantiation avoids the ambiguity of the original memory model definition proposed on the OpenMP Specification 3.0. We also introduce a new cache protocol for this instantiation, which can be implemented as a software-controlled cache. Experimental results on the Cell Broadband Engine show that our instantiation results in nearly linear speedup with respect to the number of threads for a number of NAS Parallel Benchmarks. The results also show a clear advantage when comparing it to a software cache design derived from a stronger memory model that maintains a global total ordering among flush operations.

[1]  Michel Dubois,et al.  Memory access buffering in multiprocessors , 1998, ISCA '98.

[2]  G. Gao,et al.  FAST : A Functionally Accurate Simulation Toolset for the Cyclops 64 Cellular Architecture , 2005 .

[3]  Daniel Gajski,et al.  CEDAR: a large scale multiprocessor , 1983, CARN.

[4]  Jeremy Manson,et al.  The Java memory model , 2005, POPL '05.

[5]  Michael Gschwind,et al.  Optimizing Compiler for the CELL Processor , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[6]  Eduard Ayguadé,et al.  Hybrid access-specific software cache techniques for the cell BE architecture , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7]  Michael Gschwind,et al.  Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture , 2006, IBM Syst. J..

[8]  Kevin P. McAuliffe,et al.  Automatic Management of Programmable Caches , 1988, ICPP.

[9]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[10]  Barbara Chapman A Practical Programming Model for the Multi-Core Era, 3rd International Workshop on OpenMP, IWOMP 2007, Beijing, China, June 3-7, 2007, Proceedings , 2008, IWOMP.

[11]  Bronis R. de Supinski,et al.  Complete Formal Specification of the OpenMP Memory Model , 2007, International Journal of Parallel Programming.

[12]  Kevin P. McAuliffe,et al.  The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[13]  Yi Jiang,et al.  Toward an Automatic Code Layout Methodology , 2007, IWOMP.

[14]  Ralph Grishman,et al.  The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.

[15]  Radha Jagadeesan,et al.  A theory of memory models , 2007, PPOPP.

[16]  Tao Zhang,et al.  Orchestrating data transfer for the cell/B.E. processor , 2008, ICS '08.

[17]  Arvind,et al.  Memory Model = Instruction Reordering + Store Atomicity , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[18]  Hans-Juergen Boehm,et al.  Foundations of the C++ concurrency memory model , 2008, PLDI '08.

[19]  Jungwon Kim,et al.  COMIC: A coherent shared memory interface for cell BE , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Tao Zhang,et al.  Prefetching irregular references for software cache on cell , 2008, CGO '08.

[21]  Larry Rudolph,et al.  Commit-reconcile & fences (CRF): a new memory model for architects and compiler writers , 1999, ISCA.

[22]  Mark D. Hill,et al.  A Unified Formalization of Four Shared-Memory Models , 1993, IEEE Trans. Parallel Distributed Syst..

[23]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[24]  Gurindar S. Sohi 25 Years of the International Symposia on Computer Architecture (Selected Papers) , 1998, ISCA Selected Papers.