Prefetching and cache management using task lifetimes

Task-based dataflow programming models and runtimes emerge as promising candidates for programming multicore and manycore architectures. These programming models analyze dynamically task dependencies at runtime and schedule independent tasks concurrently to the processing elements. In such models, cache locality, which is critical for performance, becomes more challenging in the presence of fine-grain tasks, and in architectures with many simple cores. This paper presents a combined hardware-software approach to improve cache locality and offer better performance is terms of execution time and energy in the memory system. We propose the explicit bulk prefetcher (EBP) and epoch-based cache management (ECM) to help runtimes prefetch task data and guide the replacement decisions in caches. The runtimem software can use this hardware support to expose its internal knowledge about the tasks to the architecture and achieve more efficient task-based execution. Our combined scheme outperforms HW-only prefetchers and state-of-the-art replacement policies, improves performance by an average of 17%, generates on average 26% fewer L2 misses, and consumes on average 28% less energy in the components of the memory system.

[1]  Srinivas Devadas,et al.  Software-assisted cache replacement mechanisms for embedded systems , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[2]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[3]  Dionisios N. Pnevmatikatos,et al.  Formic: Cost-efficient and Scalable Prototyping of Manycore Architectures , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[4]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[5]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[6]  Dimitrios S. Nikolopoulos,et al.  A Unified Scheduler for Recursive and Task Dataflow Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[7]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[8]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[9]  Arch D. Robison,et al.  Intel® Threading Building Blocks (TBB) , 2011, Encyclopedia of Parallel Computing.

[10]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[11]  George C. Necula,et al.  CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs , 2002, CC.

[12]  Mendel Rosenblum,et al.  Streamware: programming general-purpose multicore processors using streams , 2008, ASPLOS.

[13]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, IPDPS.

[14]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[15]  Polyvios Pratikakis,et al.  BDDT:: block-level dynamic dependence analysisfor deterministic task-based parallelism , 2012, PPoPP '12.

[16]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[17]  Nick Knupffer Intel Corporation , 2018, The Grants Register 2019.

[18]  J. Demmel,et al.  Sun Microsystems , 1996 .

[19]  William J. Dally,et al.  Architectural Support for the Stream Execution Model on General-Purpose Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[20]  Kathryn S. McKinley,et al.  Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[21]  Gurindar S. Sohi,et al.  Serialization sets: a dynamic dependence-based parallel execution model , 2009, PPoPP '09.

[22]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Jesús Labarta,et al.  Handling task dependencies under strided and aliased references , 2010, ICS '10.

[24]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[25]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[26]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[27]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[28]  Andrew Brownsword,et al.  Synchronization via scheduling: techniques for efficiently managing shared state , 2011, PLDI '11.

[29]  Mikko H. Lipasti,et al.  Stealth prefetching , 2006, ASPLOS XII.

[30]  Jeffrey Overbey,et al.  A type and effect system for deterministic parallel Java , 2009, OOPSLA '09.

[31]  Carole-Jean Wu,et al.  PACMan: Prefetch-Aware Cache Management for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[32]  Marcelo Cintra,et al.  Stream chaining: exploiting multiple levels of correlation in data prefetching , 2009, ISCA '09.

[33]  Arnold L. Rosenberg,et al.  Using the compiler to improve cache replacement decisions , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[34]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.