Adaptive and application dependent runtime guided hardware prefetcher reconfiguration on the IBM POWER7

Hardware data prefetcher engines have been extensively used to reduce the impact of memory latency. However, microprocessors' hardware prefetcher engines do not include any automatic hardware control able to dynamically tune their operation. This lacking architectural feature causes systems to operate with prefetchers in a fixed configuration, which in many cases harms performance and energy consumption. In this paper, a piece of software that solves the discussed problem in the context of the IBM POWER7 microprocessor is presented. The proposed solution involves using the runtime software as a bridge that is able to characterize user applications' workload and dynamically reconfigure the prefetcher engine. The proposed mechanisms has been deployed over OmpSs, a state-of-the-art task-based programming model. The paper shows significant performance improvements over a representative set of microbenchmarks and High Performance Computing (HPC) applications.

[1]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[2]  Fang Liu,et al.  Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors , 2011, SIGMETRICS '11.

[3]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[4]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[5]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[6]  Dirk Grunwald,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[7]  Onur Mutlu,et al.  Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[8]  Carole-Jean Wu,et al.  Characterization and dynamic mitigation of intra-application cache interference , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[9]  Chia-Lin Yang,et al.  Push vs. pull: data movement for linked data structures , 2000, ICS '00.

[10]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[11]  A. Patera A spectral element method for fluid dynamics: Laminar flow in a channel expansion , 1984 .

[12]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[13]  Francisco J. Cazorla,et al.  Predictable performance in SMT processors: synergy between the OS and SMTs , 2006, IEEE Transactions on Computers.

[14]  Josep Torrellas,et al.  Using a user-level memory thread for correlation prefetching , 2002, ISCA.

[15]  Francisco J. Cazorla,et al.  Making data prefetch smarter: Adaptive prefetching on POWER7 , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  Francisco J. Cazorla,et al.  FlexDCP: a QoS framework for CMP architectures , 2009, OPSR.

[17]  Donald Nguyen,et al.  Machine learning-based prefetch optimization for data center applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[18]  Francisco J. Cazorla,et al.  Software-Controlled Priority Characterization of POWER5 Processor , 2008, 2008 International Symposium on Computer Architecture.

[19]  Balaram Sinharoy,et al.  POWER7: IBM's next generation server processor , 2010, 2009 IEEE Hot Chips 21 Symposium (HCS).

[20]  Onur Mutlu,et al.  Prefetch-aware shared-resource management for multi-core systems , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[21]  D. Yeung Learning-Based SMT Processor Resource Distribution via Hill-Climbing , 2006, ISCA 2006.

[22]  Gary S. Tyson,et al.  A prefetch taxonomy , 2004, IEEE Transactions on Computers.

[23]  Onur Mutlu,et al.  Prefetch-Aware DRAM Controllers , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.