Assessing Memory Access Performance of Chapel through Synthetic Benchmarks

The Partitioned Global Address Space(PGAS) programming model strikes a balance between high performance and locality awareness. As a PGAS language, Chapel relieves programmers from handling details of data movement in a distributed memory environment, by presenting a flat memory space that is logically partitioned among executing entities. Traversing such a space requires address mapping to the system virtual address space, and as such, this abstraction inevitably causes major overheads during memory accesses. In this paper, we analyzed the extent of this overhead by implementing a micro benchmark to test different types of memory accesses that can be observed in Chapel. We showed that, as the locality gets exploited speedup gains up to 35x can be achieved. This was demonstrated through hand tuning, however. More productive means should be provided to deliver such performance improvement without excessively burdening programmers. Therefore, we also discuss possibilities to increase Chapel's performance through standard libraries, compiler, runtime and/or hardware support to handle different types of memory accesses more efficiently.

[1]  Kenjiro Taura,et al.  An Empirical Performance Study of Chapel Programming Language , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[2]  Tarek A. El-Ghazawi,et al.  Hardware support for address mapping in PGAS languages: a UPC case study , 2014, Conf. Computing Frontiers.

[3]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[4]  Tarek A. El-Ghazawi,et al.  Fast address translation techniques for distributed shared memory compilers , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[5]  ReidJohn,et al.  Co-array Fortran for parallel programming , 1998 .

[6]  John T. Richards,et al.  Productivity in Parallel Programming: A Decade of Progress , 2014 .

[7]  Katherine Yelick,et al.  Titanium Language Reference Manual , 2001 .

[8]  Tarek A. El-Ghazawi,et al.  UPC benchmarking issues , 2001, International Conference on Parallel Processing, 2001..

[9]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[10]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[11]  Christine Halverson,et al.  A decade of progress in parallel programming productivity , 2014, Commun. ACM.

[12]  Bertrand Meyer,et al.  Benchmarking Usability and Performance of Multicore Languages , 2013, 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement.