Hardware support for address mapping in PGAS languages: a UPC case study

The Partitioned Global Address Space (PGAS) programming model strikes a balance between the explicit, locality-aware, message-passing model and locality-agnostic, but easy-to-use, shared memory model (e.g. OpenMP). However, the PGAS memory model comes at a performance cost which limits both scalability and performance. Compiler optimizations are often not sufficient and manual optimizations are needed which considerably limit the productivity advantage. This paper proposes a hardware architectural support for PGAS, which allows the processor to efficiently handle shared addresses through new instructions. A prototype compiler is realized allowing to use the support with unmodified code, preserving the PGAS productivity advantage. Speedups of up to 5.5x are demonstrated on the unmodified NAS Parallel Benchmarks using the Gem5 full system simulator.

[1]  Remzi H. Arpaci-Dusseau,et al.  Empirical evaluation of the CRAY-T3D: a compiler perspective , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[2]  Jason Duell,et al.  Productivity and performance using partitioned global address space languages , 2007, PASCO '07.

[3]  Zhang Zhang,et al.  Benchmark measurements of current UPC platforms , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[4]  Tarek A. El-Ghazawi,et al.  UPC Performance and Potential: A NPB Experimental Study , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[5]  Vivek Sarkar,et al.  An Experiment in Measuring the Productivity of Three Parallel Programming Languages , 2007 .

[6]  Katherine Yelick,et al.  UPC Language Specifications V1.1.1 , 2003 .

[7]  Tarek A. El-Ghazawi,et al.  Benchmarking parallel compilers: A UPC case study , 2006, Future Gener. Comput. Syst..

[8]  Gilles Sassatelli,et al.  Accuracy evaluation of GEM5 simulator system , 2012, 7th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC).

[9]  David H. Bailey,et al.  The NAS Parallel Benchmarks 2.0 , 2015 .

[10]  Tarek A. El-Ghazawi,et al.  Fast address translation techniques for distributed shared memory compilers , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[11]  Tarek A. El-Ghazawi,et al.  Address Translation Optimization for Unified Parallel C Multi-dimensional Arrays , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[12]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[13]  Holger Fröning,et al.  Efficient hardware support for the Partitioned Global Address Space , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[14]  Katherine A. Yelick,et al.  A performance analysis of the Berkeley UPC compiler , 2003, ICS '03.

[15]  P. Amblard,et al.  Experiments Around Sparc Leon-2 For Mpeg Encoding , 2006, Proceedings of the International Conference Mixed Design of Integrated Circuits and System, 2006. MIXDES 2006..

[16]  Mohamed M. Zahran,et al.  Productivity analysis of the UPC language , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[17]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[18]  Vikram K. Narayana,et al.  An Architecture for Reconfigurable Multi-core Explorations , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[19]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[20]  Martin Danek,et al.  Instruction set extensions for multi-threading in LEON3 , 2010, 13th IEEE Symposium on Design and Diagnostics of Electronic Circuits and Systems.

[21]  Tarek A. El-Ghazawi,et al.  UPC benchmarking issues , 2001, International Conference on Parallel Processing, 2001..