论文信息 - Optimizing shared data accesses in distributed-memory X10 systems

Optimizing shared data accesses in distributed-memory X10 systems

Prior studies have established the performance impact of coherence protocols optimized for specific patterns of shared-data accesses in Non-Uniform-Memory-Architecture (NUMA) systems. First, this work incorporates a directory-based protocol into the runtime system of X10 - a Partitioned-Global-Address-Space (PGAS) programming language - to manage read-mostly, producer-consumer, stencil, and migratory variables. This protocol complements the existing X10Protocol, which keeps a unique copy of a shared variable and relies on message transfers for all remote accesses. The X10Protocol is effective to manage accumulator, write-mostly and general read-write variables. Then, it introduces a new shared-variable access-pattern profiler that is used by a new coherence-policy manager to decide which protocol should be used for each shared variable. The profiler can be run in both offline and online modes. An evaluation on a 128-core distributed-memory machine reveals that coordination between these protocols does not degrade performance on any of the applications studied, and achieves speedup in the range of 15% to 40% over X10Protocol. The performance is also comparable to carefully hand-written versions of the applications.

[1] Alan L. Cox,et al. TreadMarks: shared memory computing on networks of workstations , 1996 .

[2] Kathryn S. McKinley,et al. Data flow analysis for software prefetching linked data structures in Java , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[3] Laxmikant V. Kalé,et al. MSA: Multiphase Specifically Shared Arrays , 2004, LCPC.

[4] Jarek Nieplocha,et al. Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[5] Vijayalakshmi Srinivasan,et al. A Tagless Coherence Directory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6] José González,et al. Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in a cc-NUMA Architecture , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[7] Katherine A. Yelick,et al. Hybrid PGAS runtime support for multicore nodes , 2010, PGAS '10.

[8] Katherine A. Yelick,et al. Communication optimizations for fine-grained UPC applications , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[9] Kai Li,et al. IVY: A Shared Virtual Memory System for Parallel Computing , 1988, ICPP.

[10] Anoop Gupta,et al. The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[11] José Nelson Amaral,et al. Improving communication in PGAS environments: static and dynamic coalescing in UPC , 2013, ICS '13.

[12] Willy Zwaenepoel,et al. Techniques for reducing consistency-related communication in distributed shared-memory systems , 1995, TOCS.

[13] Laxmikant V. Kalé,et al. CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[14] Martin Burtscher,et al. Delphi: Predition-based Page Prefetching to Improve the Performance of Shared Virtual Memory Systems , 2002, PDPTA.

[15] Phillip Colella,et al. An adaptive mesh refinement benchmark for modern parallel programming languages , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[16] James K. Archibald,et al. Cache coherence protocols: evaluation using a multiprocessor simulation model , 1986, TOCS.

[17] Alan L. Cox,et al. Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[18] Stephen L. Olivier,et al. UTS: An Unbalanced Tree Search Benchmark , 2006, LCPC.

[19] Vivek Sarkar,et al. Communication Optimizations for Distributed-Memory X10 Programs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[20] Andrew Brownsword,et al. Synchronization via scheduling: techniques for efficiently managing shared state , 2011, PLDI '11.

[21] George Almási,et al. Scalable RDMA performance in PGAS languages , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[22] José Nelson Amaral,et al. Shared memory programming for large scale machines , 2006, PLDI '06.

[23] Laurie J. Hendren,et al. Communication optimizations for parallel C programs , 1998, J. Parallel Distributed Comput..

[24] Willy Zwaenepoel,et al. Munin: distributed shared memory based on type-specific memory coherence , 1990, PPOPP '90.

[25] Kevin Skadron,et al. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[26] Babak Falsafi,et al. Last-Touch Correlated Data Streaming , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[27] Ye Sun,et al. Distributed transactional memory for metric-space networks , 2005, Distributed Computing.

[28] Bradford L. Chamberlain,et al. Software transactional memory for large scale clusters , 2008, PPoPP.

[29] Paul Feautrier,et al. A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[30] David Cunningham,et al. A performance model for X10 applications: what's going on under the hood? , 2011, X10 '11.

[31] Daniel A. Reed,et al. Dynamic object management for distributed data structures , 1992, Proceedings Supercomputing '92.