DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory

Translation Look aside Buffers (TLBs) are ubiquitously used in modern architectures to cache virtual-to-physical mappings and, as they are looked up on every memory access, are paramount to performance scalability. The emergence of chip-multiprocessors (CMPs) with per-core TLBs, has brought the problem of TLB coherence to front stage. TLBs are kept coherent at the software-level by the operating system (OS). Whenever the OS modifies page permissions in a page table, it must initiate a coherency transaction among TLBs, a process known as a TLB shoot down. Current CMPs rely on the OS to approximate the set of TLBs caching a mapping and synchronize TLBs using costly Inter-Proceessor Interrupts (IPIs) and software handlers. In this paper, we characterize the impact of TLB shoot downs on multiprocessor performance and scalability, and present the design of a scalable TLB coherency mechanism. First, we show that both TLB shoot down cost and frequency increase with the number of processors and project that software-based TLB shoot downs would thwart the performance of large multiprocessors. We then present a scalable architectural mechanism that couples a shared TLB directory with load/store queue support for lightweight TLB invalidation, and thereby eliminates the need for costly IPIs. Finally, we show that the proposed mechanism reduces the fraction of machine cycles wasted on TLB shoot downs by an order of magnitude.

[1]  Bryan S. Rosenburg Low-synchronization translation lookaside buffer consistency in large-scale shared-memory multiprocessors , 1989, SOSP '89.

[2]  Margaret Martonosi,et al.  Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[3]  Dan Tsafrir,et al.  Effects of clock resolution on the scheduling of interactive and soft real-time processes , 2003, SIGMETRICS '03.

[4]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[5]  Margaret Martonosi,et al.  Inter-core cooperative TLB for chip multiprocessors , 2010, ASPLOS XV.

[6]  David L. Black,et al.  Machine-independent virtual memory management for paged uniprocessor and multiprocessor architectures , 1987, IEEE Trans. Computers.

[7]  Michael Lang,et al.  A Performance Evaluation of the Nehalem Quad-Core Processor for Scientific Computing , 2008, Parallel Process. Lett..

[8]  Milo M. K. Martin,et al.  Subtleties of transactional memory atomicity semantics , 2006, IEEE Computer Architecture Letters.

[9]  A. Ramírez,et al.  Scalable Simulation of Decoupled Accelerator Architectures , 2010 .

[10]  Mahmut T. Kandemir,et al.  Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  M. Desnoyers,et al.  Combined Tracing of the Kernel and Applications with LTTng , 2010 .

[12]  Martín Abadi,et al.  Transactional memory with strong atomicity using off-the-shelf memory protection hardware , 2009, PPoPP '09.

[13]  Norman P. Jouppi,et al.  Architecting Efficient Interconnects for Large Caches with CACTI 6.0 , 2008, IEEE Micro.

[14]  Daniel J. Sorin,et al.  UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[15]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[16]  M. Snir,et al.  TLB consistency on highly-parallel shared-memory multiprocessors , 2018, [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume I: Architecture Track.

[17]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[18]  Steven S. Lumetta,et al.  CUBA: an architecture for efficient CPU/co-processor data communication , 2008, ICS '08.

[19]  Nir Shavit,et al.  Transactional Locking II , 2006, DISC.

[20]  David H. Bailey,et al.  NAS parallel benchmark results , 1992, Proceedings Supercomputing '92.

[21]  David L. Black,et al.  Translation lookaside buffer consistency: a software approach , 1989, ASPLOS III.

[22]  Margaret Martonosi,et al.  Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[23]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[24]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[25]  Trevor N. Mudge,et al.  A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.