OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence

To extract best performance from emerging tiered memory systems, it is essential for applications to use the different kinds of memory available on the system. OpenSHMEM memory model consists of data objects that are private to each Processing Element (PE) and data objects that are remotely accessible by all PEs. The remotely accessible data objects are called Symmetric Data Objects and are allocated on a memory region called as Symmetric Heap. Symmetric Heap is created during program execution on a memory region determined by the OpenSHMEM implementation. This paper proposes a new feature called Symmetric Memory Partitions to enable users to determine the size along with other memory traits for creating the symmetric heap. Moreover, this paper uses Intel KNL processors as an example use case for emerging tiered memory systems. This paper also describes the implementation of symmetric memory partitions in Cray SHMEM and use ParRes OpenSHMEM microbenchmark kernels to show the benefits of selecting the memory region for the symmetric heap.

[1]  Steven J. Plimpton,et al.  MapReduce in MPI for Large-scale graph algorithms , 2011, Parallel Comput..

[2]  Weikuan Yu,et al.  Hadoop acceleration through network levitated merge , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Pedro J. Martín,et al.  CUDA Solutions for the SSSP Problem , 2009, ICCS.

[4]  L. R. Ford,et al.  NETWORK FLOW THEORY , 1956 .

[5]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[6]  Jacob Nelson,et al.  Comparing Runtime Systems with Exascale Ambitions Using the Parallel Research Kernels , 2016, ISC.

[7]  Richard Bellman,et al.  ON A ROUTING PROBLEM , 1958 .

[8]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[9]  Dhabaleswar K. Panda,et al.  A Comprehensive Performance Evaluation of OpenSHMEM Libraries on InfiniBand Clusters , 2014, OpenSHMEM.

[10]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[11]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[12]  David A. Padua,et al.  DSMR: a shared and distributed memory algorithm for single-source shortest path problem , 2016, PPoPP.

[13]  Ulrich Meyer,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[14]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[15]  Kurt Mehlhorn,et al.  A Parallelization of Dijkstra's Shortest Path Algorithm , 1998, MFCS.

[16]  Andrew Lumsdaine,et al.  Single-Source Shortest Paths with the Parallel Boost Graph Library , 2006, The Shortest Path Problem.

[17]  Barbara M. Chapman,et al.  Extending the OpenSHMEM Memory Model to Support User-Defined Spaces , 2014, PGAS.

[18]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[19]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[20]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[21]  Dhabaleswar K. Panda,et al.  High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.

[22]  Fabio Checconi,et al.  Scalable Single Source Shortest Path Algorithms for Massively Parallel Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[23]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[24]  Dhabaleswar K. Panda,et al.  Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[25]  Jeffery A Kuehn,et al.  OpenSHMEM Performance and Potential: A NPB Experimental Study , 2012 .

[26]  Dhabaleswar K. Panda,et al.  High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[27]  Michael Garland,et al.  Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[28]  Nicola Bombieri,et al.  An Efficient Implementation of the Bellman-Ford Algorithm for Kepler GPU Architectures , 2016, IEEE Transactions on Parallel and Distributed Systems.

[29]  Sayantan Sur,et al.  Memcached Design on High Performance RDMA Capable Interconnects , 2011, 2011 International Conference on Parallel Processing.

[30]  Manjunath Gorentla Venkata,et al.  Designing a High Performance OpenSHMEM Implementation Using Universal Common Communication Substrate as a Communication Middleware , 2014, OpenSHMEM.

[31]  Galen M. Shipman,et al.  Infiniband scalability in Open MPI , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.