Scalable address spaces using RCU balanced trees

Software developers commonly exploit multicore processors by building multithreaded software in which all threads of an application share a single address space. This shared address space has a cost: kernel virtual memory operations such as handling soft page faults, growing the address space, mapping files, etc. can limit the scalability of these applications. In widely-used operating systems, all of these operations are synchronized by a single per-process lock. This paper contributes a new design for increasing the concurrency of kernel operations on a shared address space by exploiting read-copy-update (RCU) so that soft page faults can both run in parallel with operations that mutate the same address space and avoid contending with other page faults on shared cache lines. To enable such parallelism, this paper also introduces an RCU-based binary balanced tree for storing memory mappings. An experimental evaluation using three multithreaded applications shows performance improvements on 80 cores ranging from 1.7x to 3.4x for an implementation of this design in the Linux 2.6.37 kernel. The RCU-based binary tree enables soft page faults to run at a constant cost with an increasing number of cores,suggesting that the design will scale well beyond 80 cores.

[1]  Edward M. Reingold,et al.  Binary search trees of bounded balance , 1972, SIAM J. Comput..

[2]  S. B. Yao,et al.  Efficient locking for concurrent operations on B-trees , 1981, TODS.

[3]  William Pugh,et al.  Concurrent maintenance of skip lists , 1990 .

[4]  Maurice Herlihy,et al.  A methodology for implementing highly concurrent data objects , 1993, TOPL.

[5]  Paul E. McKenney,et al.  READ-COPY UPDATE: USING EXECUTION HISTORY TO SOLVE CONCURRENCY PROBLEMS , 2002 .

[6]  Keir Fraser,et al.  Practical lock-freedom , 2003 .

[7]  Jonathan Walpole,et al.  Exploiting deferred destruction: an analysis of read-copy-update techniques in operating system kernels , 2004 .

[8]  Robert Tappan Morris,et al.  OverCite: A Distributed, Cooperative CiteSeer , 2006, NSDI.

[9]  Jason Evans April A Scalable Concurrent malloc(3) Implementation for FreeBSD , 2006 .

[10]  Dimitrios S. Nikolopoulos,et al.  Scalable locality-conscious multithreaded memory allocation , 2006, ISMM '06.

[11]  Keir Fraser,et al.  Concurrent programming without locks , 2007, TOCS.

[12]  Jonathan Walpole,et al.  Introducing technology into the Linux kernel: a case study , 2008, OPSR.

[13]  Yang Zhang,et al.  Corey: An Operating System for Many Cores , 2008, OSDI.

[14]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[15]  Robert Morris,et al.  Optimizing MapReduce for Multicore Architectures , 2010 .

[16]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[17]  Scalable address spaces using RCU balanced trees , 2012, ASPLOS.

[18]  Jonathan Walpole,et al.  Relativistic red‐black trees , 2014, Concurr. Comput. Pract. Exp..