Optimizing shared data accesses in distributed-memory X10 systems

Prior studies have established the performance impact of coherence protocols optimized for specific patterns of shared-data accesses in Non-Uniform-Memory-Architecture (NUMA) systems. First, this work incorporates a directory-based protocol into the runtime system of X10 - a Partitioned-Global-Address-Space (PGAS) programming language - to manage read-mostly, producer-consumer, stencil, and migratory variables. This protocol complements the existing X10Protocol, which keeps a unique copy of a shared variable and relies on message transfers for all remote accesses. The X10Protocol is effective to manage accumulator, write-mostly and general read-write variables. Then, it introduces a new shared-variable access-pattern profiler that is used by a new coherence-policy manager to decide which protocol should be used for each shared variable. The profiler can be run in both offline and online modes. An evaluation on a 128-core distributed-memory machine reveals that coordination between these protocols does not degrade performance on any of the applications studied, and achieves speedup in the range of 15% to 40% over X10Protocol. The performance is also comparable to carefully hand-written versions of the applications.

[1]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[2]  Kathryn S. McKinley,et al.  Data flow analysis for software prefetching linked data structures in Java , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[3]  Laxmikant V. Kalé,et al.  MSA: Multiphase Specifically Shared Arrays , 2004, LCPC.

[4]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[5]  Vijayalakshmi Srinivasan,et al.  A Tagless Coherence Directory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  José González,et al.  Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in a cc-NUMA Architecture , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[7]  Katherine A. Yelick,et al.  Hybrid PGAS runtime support for multicore nodes , 2010, PGAS '10.

[8]  Katherine A. Yelick,et al.  Communication optimizations for fine-grained UPC applications , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[9]  Kai Li,et al.  IVY: A Shared Virtual Memory System for Parallel Computing , 1988, ICPP.

[10]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[11]  José Nelson Amaral,et al.  Improving communication in PGAS environments: static and dynamic coalescing in UPC , 2013, ICS '13.

[12]  Willy Zwaenepoel,et al.  Techniques for reducing consistency-related communication in distributed shared-memory systems , 1995, TOCS.

[13]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[14]  Martin Burtscher,et al.  Delphi: Predition-based Page Prefetching to Improve the Performance of Shared Virtual Memory Systems , 2002, PDPTA.

[15]  Phillip Colella,et al.  An adaptive mesh refinement benchmark for modern parallel programming languages , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[16]  James K. Archibald,et al.  Cache coherence protocols: evaluation using a multiprocessor simulation model , 1986, TOCS.

[17]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[18]  Stephen L. Olivier,et al.  UTS: An Unbalanced Tree Search Benchmark , 2006, LCPC.

[19]  Vivek Sarkar,et al.  Communication Optimizations for Distributed-Memory X10 Programs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[20]  Andrew Brownsword,et al.  Synchronization via scheduling: techniques for efficiently managing shared state , 2011, PLDI '11.

[21]  George Almási,et al.  Scalable RDMA performance in PGAS languages , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[22]  José Nelson Amaral,et al.  Shared memory programming for large scale machines , 2006, PLDI '06.

[23]  Laurie J. Hendren,et al.  Communication optimizations for parallel C programs , 1998, J. Parallel Distributed Comput..

[24]  Willy Zwaenepoel,et al.  Munin: distributed shared memory based on type-specific memory coherence , 1990, PPOPP '90.

[25]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[26]  Babak Falsafi,et al.  Last-Touch Correlated Data Streaming , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[27]  Ye Sun,et al.  Distributed transactional memory for metric-space networks , 2005, Distributed Computing.

[28]  Bradford L. Chamberlain,et al.  Software transactional memory for large scale clusters , 2008, PPoPP.

[29]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[30]  David Cunningham,et al.  A performance model for X10 applications: what's going on under the hood? , 2011, X10 '11.

[31]  Daniel A. Reed,et al.  Dynamic object management for distributed data structures , 1992, Proceedings Supercomputing '92.