Evaluating the Memory System Performance of Software-Initiated Inter-core LLC Prepushing

Data prefetching speculatively issue memory requests for data needed later by the main computation, and therefore can lead to increased stress on limited resources on chip multiprocessors. If not properly used, it can cause harmful effects such as cache pollution and waste of bandwidth. Therefore, accurate and fine grain measurement of the related runtime metrics is important as the first step in reducing harmful prefetches and increasing memory level parallelism on chip multiprocessors. However, the required measurement is prohibitively impossible on real machines without bringing nontrivial performance overhead and thus leading to inaccurate results. In this paper, we use cycle accurate full-system simulation to study the memory system performance of our previous proposed data prefetching technique with control of harmful prefetches on chip multiprocessors - software-initiated inter-core LLC prepushing. We modified the GEMS multiprocessor simulator to support trace-based measurement and offline analysis of MLP, DRAM BLP and their relationship with software-initiated intercore LLC prepushing. Results show that, prepushing can achieve speedups of 1.628, 1.019 and 1.032 in mst, em3d and 429.mcf, respectively. Average L2 MLP is increased by 26%, 0.3% and-1%, in mst, em3d and 429.mcf, respectively.

[1]  Onur Mutlu,et al.  Prefetch-Aware DRAM Controllers , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[2]  Francisco Javier Cazorla Almeida,et al.  MLP-aware dynamic cache partitioning , 2007, PACT 2007.

[3]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[4]  Michael Stumm,et al.  RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations , 2009, ASPLOS.

[5]  Collin McCurdy,et al.  Using Pin as a memory reference generator for multiprocessor simulation , 2005, CARN.

[6]  Onur Mutlu,et al.  Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[7]  Josep Torrellas,et al.  Scalable Cache Miss Handling for High Memory-Level Parallelism , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[8]  Onur Mutlu,et al.  Improving memory Bank-Level Parallelism in the presence of prefetching , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Zhimin Gu,et al.  Performance evaluation of data-push thread on commercial CMP platform , 2010, INC2010: 6th International Conference on Networked Computing.

[10]  Francisco J. Cazorla,et al.  MLP-Aware Dynamic Cache Partitioning , 2008, HiPEAC.

[11]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[12]  Rastislav Bodík,et al.  An efficient profile-analysis framework for data-layout optimizations , 2002, POPL '02.

[13]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[14]  Michel Dubois,et al.  Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[15]  O. Ozturk,et al.  Cache Miss Clustering for Banked Memory Systems , 2006, 2006 IEEE/ACM International Conference on Computer Aided Design.

[16]  Saurabh Sharma,et al.  Spectral prefetcher: An effective mechanism for L2 cache prefetching , 2005, TACO.

[17]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[18]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[19]  Michel Dubois,et al.  Cost-sensitive cache replacement algorithms , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[20]  Zhimin Gu,et al.  The Stable Conditions of a Task-Pair with Helper-Thread in CMP , 2009, PDPTA.

[21]  Surendra Byna,et al.  Taxonomy of Data Prefetching for Multicore Processors , 2009, Journal of Computer Science and Technology.

[22]  Krishna V. Palem,et al.  A framework for data prefetching using off-line training of Markovian predictors , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[23]  Sarita V. Adve,et al.  Code transformations to improve memory parallelism , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.