TCU: A Multi-Objective Hardware Thread Mapping Unit for HPC Clusters

Meeting multiple, partially orthogonal optimization targets during thread scheduling on HPC and manycore platforms simultaneously, like maximizing CPU performance, meeting deadlines of time critical tasks, minimizing power and securing thermal resilience, is a major challenge because of associated scalability and thread management overhead. We tackle these challenges by introducing the Thread Control Unit (TCU), a configurable, low-latency, low-overhead hardware thread mapper in compute nodes of an HPC cluster. The TCU takes various sensor information into account and can map threads to 4–16 CPUs of a compute node within a small and bounded number of clock cycles in round-robin, single- or multi-objective manner. The TCU design can consider not just load balancing or performance criteria but also physical constraints like temperature limits, power budgets and reliability aspects. Evaluations of different mapping policies show that multi-objective thread mapping provides about 10 to 40 % less mapping latency for periodic workloads compared to single-objective or round-robin policies. For bursty workloads under high load conditions, a 20 % reduction is achieved.

[1]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[2]  Claes Wikström,et al.  Concurrent programming in ERLANG (2nd ed.) , 1996 .

[3]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[4]  Tajana Simunic,et al.  Static and Dynamic Temperature-Aware Scheduling for Multiprocessor SoCs , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[5]  Kevin Skadron,et al.  Performance, energy, and thermal considerations for SMT and CMP architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[6]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[7]  Karthikeyan Sankaralingam,et al.  Dark silicon and the end of multicore scaling , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[8]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[9]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[10]  Jörg Henkel,et al.  Invasive manycore architectures , 2012, 17th Asia and South Pacific Design Automation Conference.

[11]  John Kubiatowicz,et al.  Tessellation: Refactoring the OS around explicit resource containers with continuous adaptation , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[12]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[13]  Andreas Herkersdorf,et al.  Hardware assisted thread assignment for RISC based MPSoCs in invasive computing , 2011, 2011 International Symposium on Integrated Circuits.