Efficient hardware support for the Partitioned Global Address Space

We present a novel architecture of a communication engine for non-coherent distributed shared memory systems. The shared memory is composed by a set of nodes exporting their memory. Remote memory access is possible by forwarding local load or store transactions to remote nodes. No software layers are involved in a remote access, neither on origin or target side: a user level process can directly access remote locations without any kind of software involvement. We have implemented the architecture as an FPGA-based prototype in order to demonstrate the functionality of the complete system. This prototype also allows real world measurements in order to show the performance potential of this architecture, in particular for fine grain memory accesses like they are typically used for synchronization tasks.

[1]  Holger Fröning,et al.  An FPGA-Based Custom High Performance Interconnection Network , 2009, 2009 International Conference on Reconfigurable Computing and FPGAs.

[2]  Katherine Yelick,et al.  UPC Language Specifications V1.1.1 , 2003 .

[3]  Dan Bonachea Proposal for extending the upc memory copy library functions and supporting extensions to gasnet , 2004 .

[4]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[5]  Subhash Saini,et al.  Application-based early performance evaluation of SGI altix 4700 systems for SGI systems , 2008, CF '08.

[6]  Pat Conway,et al.  The AMD Opteron Northbridge Architecture , 2007, IEEE Micro.

[7]  Ulrich Brüning,et al.  An open-source HyperTransport core , 2008, TRETS.

[8]  Jason Duell,et al.  Productivity and performance using partitioned global address space languages , 2007, PASCO '07.

[9]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[10]  Sudhakar Yalamanchili,et al.  A Dynamic, Partitioned Global Address Space Model for High Performance Clusters , 2008 .

[11]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[12]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[13]  Philip Heidelberger,et al.  HPCC RandomAccess benchmark for next generation supercomputers , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[14]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[15]  David Slogsnat,et al.  The HTX-Board : A Rapid Prototyping Station , 2005 .

[16]  Sudhakar Yalamanchili,et al.  Extending HyperTransport Protocol for Improved Scalability , 2009 .