NightWatch: Integrating Lightweight and Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

Cache pollution, by which weak-locality data unduly replaces strong-locality data, may notably degrade application performance in a shared-cache multicore machine. This paper presents NightWatch, a cache management subsystem that provides general, transparent and low-overhead pollution control to applications. NightWatch is based on the observation that data within the same memory chunk or chunks within the same allocation context often share similar locality property. NightWatch embodies this observation by online monitoring current cache locality to predict future behavior and restricting potential cache polluters proactively. We have integrated NightWatch into two popular allocators, tcmalloc and ptmalloc2. Experiments with SPEC CPU2006 show that NightWatch improves application performance by up to 45% (18% on average), with an average monitoring overhead of 0.57% (up to 3.02%).

[1]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[2]  Zhao Zhang,et al.  Soft-OLP: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[3]  David Eklov,et al.  Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Michael Stumm,et al.  RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations , 2009, ASPLOS.

[5]  R. Govindarajan,et al.  NUcache: An efficient multicore cache organization based on Next-Use distance , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[6]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[7]  Ragunathan Rajkumar,et al.  A Coordinated Approach for Practical OS-Level Cache Management in Multi-core Real-Time Systems , 2013, 2013 25th Euromicro Conference on Real-Time Systems.

[8]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[9]  Christoforos E. Kozyrakis,et al.  Vantage: Scalable and efficient fine-grain cache partitioning , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[10]  Carole-Jean Wu,et al.  PACMan: Prefetch-Aware Cache Management for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Wolfgang Mauerer,et al.  Professional Linux Kernel Architecture , 2008 .

[12]  Xiao Zhang,et al.  Towards practical page coloring-based multicore cache management , 2009, EuroSys '09.

[13]  Ying Ye,et al.  COLORIS: A dynamic cache partitioning system using page coloring , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[14]  Zhao Zhang,et al.  Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[15]  Michael Stumm,et al.  Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[16]  P. Sadayappan,et al.  PARDA: A Fast Parallel Reuse Distance Analysis Algorithm , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[17]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[18]  David Xinliang Li,et al.  Automated locality optimization based on the reuse distance of string operations , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[19]  Swann Perarnau,et al.  Controlling cache utilization of HPC applications , 2011, ICS '11.

[20]  Gabriel H. Loh,et al.  PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[21]  Milind Kulkarni,et al.  Accelerating multicore reuse distance analysis with sampling and parallelization , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  Xiaoning Ding,et al.  ULCC: a user-level facility for optimizing shared cache performance on multicores , 2011, PPoPP '11.

[23]  Aamer Jaleel,et al.  CRUISE: cache replacement and utility-aware scheduling , 2012, ASPLOS XVII.

[24]  Erik Hagersten,et al.  Fast data-locality profiling of native execution , 2005, SIGMETRICS '05.

[25]  Mateo Valero,et al.  Improving Cache Management Policies Using Dynamic Reuse Distances , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[26]  Periklis Akritidis,et al.  Cling: A Memory Allocator to Mitigate Dangling Pointers , 2010, USENIX Security Symposium.

[27]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[28]  Haibo Chen,et al.  SSMalloc: a low-latency, locality-conscious memory allocator with stable performance scalability , 2012, APSys.