Combining Prefetch Control and Cache Partitioning to Improve Multicore Performance

Modern commercial multi-core processors are equipped with multiple hardware prefetchers on each core. The prefetchers can significantly improve application performance. However, shared resources, such as last-level cache (LLC) and off-chip memory bandwidth and controller, can lead to prefetch interference. Multiple techniques have been proposed to reduce such interference and improve the performance isolation across cores, such as coordinated control among prefetchers and cache partitioning (CP). Each of them has its advantages and disadvantages. This paper proposes combining these two techniques in a coordinated way. Prefetchers and LLC are treated as separate resources and a multi-resource management mechanism is proposed to control prefetching and cache partitioning. This control mechanism is implemented as a Linux kernel module and can be applied to a wide variety of prefetch architectures. An implementation on Intel Xeon E5 v4 processor shows that combining LLC partitioning and prefetch throttling provides a significant improvement in performance and fairness.

[1]  Xiaotong Zhuang,et al.  Reducing Cache Pollution via Dynamic Data Prefetch Filtering , 2007, IEEE Transactions on Computers.

[2]  Alexandra Fedorova,et al.  Managing Contention for Shared Resources on Multicore Processors , 2010 .

[3]  Xiaodong Wang,et al.  SWAP: Effective Fine-Grain Management of Shared Last-Level Caches with Minimum Hardware Support , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[4]  Lieven Eeckhout,et al.  Fairness-aware scheduling on single-ISA heterogeneous multi-cores , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[5]  Francisco J. Cazorla,et al.  Increasing multicore system efficiency through intelligent bandwidth shifting , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[6]  Carole-Jean Wu,et al.  Characterization and dynamic mitigation of intra-application cache interference , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[7]  Yan Solihin,et al.  QoS policies and architecture for cache/memory in CMP platforms , 2007, SIGMETRICS '07.

[8]  André Seznec,et al.  Band-Pass Prefetching , 2017, ACM Trans. Archit. Code Optim..

[9]  Jennifer L. Wong,et al.  To hardware prefetch or not to prefetch?: a virtualized environment study and core binding approach , 2013, ASPLOS '13.

[10]  Onur Mutlu,et al.  Prefetch-aware shared-resource management for multi-core systems , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[11]  Onur Mutlu,et al.  Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[12]  Onur Mutlu,et al.  The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Lieven Eeckhout,et al.  Application Clustering Policies to Address System Fairness with Intel’s Cache Allocation Technology , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  Aamer Jaleel,et al.  Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[15]  Onur Mutlu,et al.  Coordinated control of multiple prefetchers in multi-core systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Balaram Sinharoy,et al.  IBM POWER7 multicore server processor , 2011 .

[17]  David A. Patterson,et al.  A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness , 2013, ISCA.

[18]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[19]  Shankar Balachandran,et al.  CAFFEINE , 2015, ACM Trans. Archit. Code Optim..

[20]  Pierre Michaud Best-offset hardware prefetching , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Donald Nguyen,et al.  Machine learning-based prefetch optimization for data center applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[22]  Mahmut T. Kandemir,et al.  Courteous cache sharing: Being nice to others in capacity management , 2012, DAC Design Automation Conference 2012.

[23]  Manoj Franklin,et al.  Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[24]  David Black-Schaffer,et al.  AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[25]  Víctor Viñals,et al.  ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache , 2012, TACO.

[26]  Alexander V. Veidenbaum,et al.  Multiple stream tracker: a new hardware stride prefetcher , 2014, Conf. Computing Frontiers.

[27]  Onur Mutlu,et al.  Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[28]  Avi Mendelson,et al.  Fairness enforcement in switch on event multithreading , 2007, TACO.

[29]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[30]  Xiaosong Ma,et al.  KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[31]  Francisco J. Cazorla,et al.  Making data prefetch smarter: Adaptive prefetching on POWER7 , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32]  Kei Hiraki,et al.  Access map pattern matching for data cache prefetch , 2009, ICS.

[33]  John Turek,et al.  Optimal Partitioning of Cache Memory , 1992, IEEE Trans. Computers.

[34]  Peng Liu,et al.  A Thread-Aware Adaptive Data Prefetcher , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[35]  K.J. Nesbit,et al.  AC/DC: an adaptive data cache prefetcher , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[36]  Hans Vandierendonck,et al.  Fairness Metrics for Multi-Threaded Processors , 2011, IEEE Computer Architecture Letters.

[37]  Avi Mendelson,et al.  A PAB-Based Multi-Prefetcher Mechanism , 2006, International Journal of Parallel Programming.

[38]  Fang Liu,et al.  Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors , 2011, SIGMETRICS '11.

[39]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[40]  Engin Ipek,et al.  Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[41]  Calvin Lin,et al.  Memory Prefetching Using Adaptive Stream Detection , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[42]  Biswabandan Panda,et al.  SPAC: A Synergistic Prefetcher Aggressiveness Controller for Multi-Core Systems , 2016, IEEE Transactions on Computers.