Bounded incoherence: a programming model for non-cache-coherent shared memory architectures

Cache coherence in modern computer architectures enables easier programming by sharing data across multiple processors. Unfortunately, it can also limit scalability due to cache coherency traffic initiated by competing memory accesses. Rack-scale systems introduce shared memory across a whole rack, but without inter-node cache coherence. This poses memory management and concurrency control challenges for applications that must explicitly manage cache-lines. To fully utilize rack-scale systems for low-latency and scalable computation, applications need to maintain cached memory accesses in spite of non-coherency. This paper introduces Bounded Incoherence, a programming and memory consistency model that enables cached access to shared data-structures in non-cache-coherency memory. It ensures that updates to memory on one node are visible within at most a bounded amount of time on all other nodes. We evaluate this memory model on modified PowerGraph graph processing framework, and boost its performance by 30% with eight sockets by enabling cached-access to data-structures.

[1]  Scott Devine,et al.  Disco: running commodity operating systems on scalable multiprocessors , 1997, TOCS.

[2]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[3]  Gang Chen,et al.  Efficient Distributed Memory Management with RDMA and Caching , 2018, Proc. VLDB Endow..

[4]  Paul E. McKenney,et al.  RCU Usage In the Linux Kernel : One Decade Later , 2012 .

[5]  Kirk L. Johnson,et al.  CRL: high-performance all-software distributed shared memory , 1995, SOSP.

[6]  Timothy G. Mattson,et al.  Light-weight communications on Intel's single-chip cloud computer processor , 2011, OPSR.

[7]  Qi Wang,et al.  Parallel sections: scaling system-level data-structures , 2016, EuroSys.

[8]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[9]  Jonathan Walpole,et al.  Performance of memory reclamation for lockless synchronization , 2007, J. Parallel Distributed Comput..

[10]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[11]  Hagit Attiya,et al.  Concurrent updates with RCU: search tree as an example , 2014, PODC '14.

[12]  Qi Wang,et al.  SPeCK: a kernel for scalable predictability , 2015, 21st IEEE Real-Time and Embedded Technology and Applications Symposium.

[13]  Srinivasan Parthasarathy,et al.  Cashmere-2L: software coherent shared memory on a clustered remote-write network , 1997, SOSP.

[14]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[15]  Ivan Beschastnikh,et al.  Scalable consistency in Scatter , 2011, SOSP.

[16]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[17]  Timothy L. Harris,et al.  Hardware Trends: Challenges and Opportunities in Distributed Computing , 2015, SIGA.

[18]  Adrian Schüpbach,et al.  Early experience with the Barrelfish OS and the Single-Chip Cloud Computer , 2011, MARC Symposium.

[19]  Theodore Johnson,et al.  A Nonblocking Algorithm for Shared Queues Using Compare-and-Swap , 1994, IEEE Trans. Computers.

[20]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[21]  Yuxin Ren,et al.  Scalable Memory Reclamation for Multi-Core, Real-Time Systems , 2018, 2018 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).

[22]  Hans-Juergen Boehm,et al.  Atlas: leveraging locks for non-volatile memory consistency , 2014, OOPSLA.

[23]  M. Frans Kaashoek,et al.  Hare: a file system for non-cache-coherent multicores , 2015, EuroSys.

[24]  Anoop Gupta,et al.  Hive: fault containment for shared-memory multiprocessors , 1995, SOSP.

[25]  Kourosh Gharachorloo,et al.  Towards transparent and efficient software distributed shared memory , 1997, SOSP.

[26]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[27]  Krste Asanovic,et al.  FireBox: A Hardware Building Block for 2020 Warehouse-Scale Computers , 2014 .

[28]  Nir Shavit,et al.  Read-log-update: a lightweight synchronization mechanism for concurrent programming , 2015, SOSP.

[29]  Stefanos Kaxiras,et al.  Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory , 2015, HPDC.

[30]  Miguel Castro,et al.  No compromises: distributed transactions with consistency, availability, and performance , 2015, SOSP.

[31]  Binoy Ravindran,et al.  libMPNode: An OpenMP Runtime For Parallel Processing Across Incoherent Domains , 2019, PMAM@PPoPP.

[32]  Tarek A. El-Ghazawi,et al.  An evaluation of global address space languages: co-array fortran and unified parallel C , 2005, PPoPP.

[33]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[34]  Jonathan Walpole,et al.  User-Level Implementations of Read-Copy Update , 2012, IEEE Transactions on Parallel and Distributed Systems.

[35]  Babak Falsafi,et al.  The Case for RackOut: Scalable Data Serving Using Rack-Scale Systems , 2016, SoCC.

[36]  Dejan S. Milojicic,et al.  Beyond Processor-centric Operating Systems , 2015, HotOS.

[37]  K. Gopinath,et al.  Prudent Memory Reclamation in Procrastination-Based Synchronization , 2016, ASPLOS.

[38]  Maged M. Michael Hazard pointers: safe memory reclamation for lock-free objects , 2004, IEEE Transactions on Parallel and Distributed Systems.

[39]  M. Frans Kaashoek,et al.  RadixVM: scalable address spaces for multithreaded applications , 2013, EuroSys '13.

[40]  Jacob Nelson,et al.  Latency-Tolerant Software Distributed Shared Memory , 2015, USENIX ATC.

[41]  Seunghak Lee,et al.  Exploiting Bounded Staleness to Speed Up Big Data Analytics , 2014, USENIX Annual Technical Conference.

[42]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[43]  Ying Liu,et al.  Lazygraph: lazy data coherency for replicas in distributed graph-parallel computation , 2018, PPoPP.

[44]  Mendel Rosenblum,et al.  Cellular disco: resource management using virtual clusters on shared-memory multiprocessors , 2000, TOCS.

[45]  Michael L. Scott,et al.  Interval-based memory reclamation , 2018, PPOPP.

[46]  Willy Zwaenepoel,et al.  Munin: distributed shared memory based on type-specific memory coherence , 1990, PPOPP '90.