CATA: Criticality Aware Task Acceleration for Multicore Processors

Managing criticality in task-based programming models opens a wide range of performance and power optimization opportunities in future manycore systems. Criticality aware task schedulers can benefit from these opportunities by scheduling tasks to the most appropriate cores. However, these schedulers may suffer from priority inversion and static binding problems that limit their expected improvements. Based on the observation that task criticality information can be exploited to drive hardware reconfigurations, we propose a Criticality Aware Task Acceleration (CATA) mechanism that dynamically adapts the computational power of a task depending on its criticality. As a result, CATA achieves significant improvements over a baseline static scheduler, reaching average improvements up to 18.4% in execution time and 30.1% in Energy-Delay Product (EDP) on a simulated 32-core system. The cost of reconfiguring hardware by means of a software-only solution rises with the number of cores due to lock contention and reconfiguration overhead. Therefore, novel architectural support is proposed to eliminate these overheads on future manycore systems. This architectural support minimally extends hardware structures already present in current processors, which allows further improvements in performance with negligible overhead. As a consequence, average improvements of up to 20.4% in execution time and 34.0% in EDP are obtained, outperforming state-of-the-art acceleration proposals not aware of task criticality.

[1]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  Eduard Ayguadé,et al.  Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architectures , 2015, ICS.

[3]  Margaret Martonosi,et al.  Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors , 2009, ISCA '09.

[4]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[5]  Radu Teodorescu,et al.  Mitigating the Effects of Process Variation in Ultra-low Voltage Chip Multiprocessors using Dual Supply Voltages and Half-Speed Units , 2012, IEEE Computer Architecture Letters.

[6]  Eduard Ayguadé,et al.  Runtime-Aware Architectures: A First Approach , 2014, Supercomput. Front. Innov..

[7]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[8]  Dimitrios S. Nikolopoulos,et al.  A Unified Scheduler for Recursive and Task Dataflow Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[9]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[10]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[11]  Margaret Martonosi,et al.  Computer Architecture Techniques for Power-Efficiency , 2008, Computer Architecture Techniques for Power-Efficiency.

[12]  Meeta Sharma Gupta,et al.  System level analysis of fast, per-core DVFS using on-chip switching regulators , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[13]  Ulrich Kremer,et al.  The design, implementation, and evaluation of a compiler algorithm for CPU energy reduction , 2003, PLDI '03.

[14]  Per Stenström,et al.  Efficient Forwarding of Producer-Consumer Data in Task-Based Programs , 2013, 2013 42nd International Conference on Parallel Processing.

[15]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[16]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[17]  Vivek Sarkar,et al.  Chunking parallel loops in the presence of synchronization , 2009, ICS.

[18]  Francisco J. Cazorla,et al.  Software-Controlled Priority Characterization of POWER5 Processor , 2008, 2008 International Symposium on Computer Architecture.

[19]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[20]  Margaret Martonosi,et al.  Techniques for Multicore Thermal Management: Classification and New Exploration , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[21]  Christoforos E. Kozyrakis,et al.  Dynamic management of TurboMode in modern multi-core chips , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[22]  Onur Mutlu,et al.  Bottleneck identification and scheduling in multithreaded applications , 2012, ASPLOS XVII.

[23]  Eduard Ayguadé,et al.  Task Superscalar: An Out-of-Order Task Pipeline , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[24]  Stefanos Kaxiras,et al.  Introducing DVFS-Management in a Full-System Simulator , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.

[25]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Onur Mutlu,et al.  Utility-based acceleration of multithreaded applications on asymmetric CMPs , 2013, ISCA.

[27]  Henry Hoffmann,et al.  Application heartbeats: a generic interface for specifying program performance and goals in autonomous computing environments , 2010, ICAC '10.

[28]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[29]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[30]  David A. Patterson,et al.  A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness , 2013, ISCA.

[31]  Xiang Pan,et al.  Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[32]  Francisco J. Cazorla,et al.  Making data prefetch smarter: Adaptive prefetching on POWER7 , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[33]  José González,et al.  Meeting points: Using thread criticality to adapt multicore hardware to parallel regions , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[34]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[35]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[36]  Stijn Eyerman,et al.  Criticality stacks: identifying critical threads in parallel programs using synchronization behavior , 2013, ISCA.

[37]  Eduard Ayguadé,et al.  PARSECSs: Evaluating the Impact of Task Parallelism in the PARSEC Benchmark Suite , 2016, ACM Trans. Archit. Code Optim..

[38]  Stijn Eyerman,et al.  Fine-grained DVFS using on-chip regulators , 2011, TACO.

[39]  Scott A. Mahlke,et al.  Embracing heterogeneity with dynamic core boosting , 2014, Conf. Computing Frontiers.

[40]  Per Stenström,et al.  Runtime-Guided Cache Coherence Optimizations in Multi-core Architectures , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[41]  Yale N. Patt,et al.  MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[42]  Dionisios N. Pnevmatikatos,et al.  Prefetching and cache management using task lifetimes , 2013, ICS '13.

[43]  Mateo Valero,et al.  Runtime Aware Architectures , 2016, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[44]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[45]  Trevor Mudge,et al.  Reevaluating Fast Dual-Voltage Power Rail Switching Circuitry , 2012 .

[46]  Pradip Bose,et al.  Crank it up or dial it down: Coordinated multiprocessor frequency and folding control , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47]  Christoforos E. Kozyrakis,et al.  Flexible architectural support for fine-grain scheduling , 2010, ASPLOS XV.

[48]  Stefanos Kaxiras,et al.  Interval-based models for run-time DVFS orchestration in superscalar processors , 2010, CF '10.