An asymmetric distributed shared memory model for heterogeneous parallel systems

Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory. This paper presents a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data objects to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs. We argue that ADSM reduces programming efforts for heterogeneous computing systems and enhances application portability. We present a software implementation of ADSM, called GMAC, on top of CUDA in a GNU/Linux environment. We show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmer-managed data transfers. This paper presents the GMAC system and evaluates different design choices. We further suggest additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model.

[1]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[2]  Brett D. Fleisch,et al.  Mirage: a coherent distributed shared memory design , 1989, SOSP '89.

[3]  Henri E. Bal,et al.  Distributed programming with shared data , 1988, Proceedings. 1988 International Conference on Computer Languages.

[4]  Peter G. Harrison,et al.  Parallel Programming Using Skeleton Functions , 1993, PARLE.

[5]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[6]  David B. Gustavson The Scalable Coherent Interface and related standards projects , 1992, IEEE Micro.

[7]  Wen-mei W. Hwu,et al.  Modular interprocedural pointer analysis using access paths: design, implementation, and evaluation , 2000, PLDI '00.

[8]  J. Rothnie,et al.  The KSR 1: bridging the gap between shared memory and MPPs , 1993, Digest of Papers. Compcon Spring.

[9]  John Wawrzynek,et al.  Garp: a MIPS processor with a reconfigurable coprocessor , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[10]  Robert J. Harrison,et al.  Global Arrays: a portable "shared-memory" programming model for distributed memory computers , 1994, Proceedings of Supercomputing '94.

[11]  Steven S. Lumetta,et al.  CUBA: an architecture for efficient CPU/co-processor data communication , 2008, ICS '08.

[12]  Alessandro Forin,et al.  Multilanguage Parallel Programming of Heterogeneous Machines , 1988, IEEE Trans. Computers.

[13]  Mosur Ravishankar,et al.  PLUS: a distributed shared-memory system , 1990, ISCA '90.

[14]  Anoop Gupta,et al.  The performance impact of flexibility in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[15]  Seif Haridi,et al.  Data Diffusion Machine - A Scalable Shared Virtual Memory Multiprocessor , 1988, FGCS.

[16]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[17]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[18]  Larry D. Wittie,et al.  MERLIN. A superglue for multicomputer systems , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.

[19]  Ian Buck,et al.  GPU computing with NVIDIA CUDA , 2007, SIGGRAPH Courses.

[20]  Sanjay J. Patel,et al.  Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[21]  Scott Hauck,et al.  The Chimaera reconfigurable functional unit , 2004 .

[22]  Marco Vanneschi,et al.  The programming model of ASSIST, an environment for parallel and distributed portable applications , 2002, Parallel Comput..

[23]  Nicholas Carriero,et al.  Linda and Friends , 1986, Computer.

[24]  Grigori Fursin,et al.  Predictive Runtime Code Scheduling for Heterogeneous Architectures , 2008, HiPEAC.

[25]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[26]  Stamatis Vassiliadis,et al.  The MOLEN polymorphic processor , 2004, IEEE Transactions on Computers.

[27]  Richard P. LaRowe,et al.  Hardware assist for distributed shared memory , 1993, [1993] Proceedings. The 13th International Conference on Distributed Computing Systems.

[28]  Scott Pakin,et al.  Entering the petaflop era: the architecture and performance of Roadrunner , 2008, HiPC 2008.

[29]  Sanjay J. Patel,et al.  Accelerator Architectures , 2008, IEEE Micro.

[30]  A. S. Sethi,et al.  An analysis of Memnet—an experiment in high-speed shared-memory local networking , 1988, SIGCOMM 1988.

[31]  Brian N. Bershad,et al.  The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[32]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[33]  Nancy M. Amato,et al.  STAPL: An Adaptive, Generic Parallel C++ Library , 2001, LCPC.

[34]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[35]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[36]  Michael Stumm,et al.  Extending distributed shared memory to heterogeneous environments , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[37]  Fadi J. Kurdahi,et al.  MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications , 2000, IEEE Trans. Computers.

[38]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[39]  Partha Dasgupta,et al.  The Clouds distributed operating system: functional description, implementation details and related work , 1988, [1988] Proceedings. The 8th International Conference on Distributed.