Thread shuffling: Combining DVFS and thread migration to reduce energy consumptions for multi-core systems

In recent years, multi-core systems have become mainstream in computer industry. The design of multi-cores takes advantage of thread-level parallelism in emerging applications that are computationally intensive and highly parallel. Energy efficiency is one of the biggest challenges in the design of multi-core systems, and workload imbalance among parallel threads is one of sources of energy inefficiency. Many techniques based on dynamic voltage frequency scaling (DVFS) are proposed to save energy consumptions on multi-cores, but all of them assume that each core in a multi-core system contains only one hardware context and only one thread can execute on one core at a time. However, mainstream multi-core systems are moving to have simultaneous multithreading (SMT) support in cores, and existing DVFS-based techniques are not effective to achieve maximum energy savings. In this paper, we present a novel technique called thread shuffling, which combines thread migration and DVFS to achieve maximum energy savings and maintain performance on a multi-core system supporting SMT. Thread shuffling is implemented and simulated in a cycle-accurate ×86 multi-core system. The experiments show that it achieves up to 56% energy savings without performance penalty for selected Recognition, Mining and Synthesis (RMS) applications from Intel Labs.

[1]  Charles E. Molnar,et al.  Anomalous Behavior of Synchronizer and Arbiter Circuits , 1973, IEEE Transactions on Computers.

[2]  Richard Uhlig,et al.  SoftSDV: A Presilicon Software Development Environment for the IA-64 Architecture , 1999 .

[3]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[4]  Alan Jay Smith,et al.  Improving dynamic voltage scaling algorithms with PACE , 2001, SIGMETRICS '01.

[5]  Steven M. Nowick,et al.  Robust interfaces for mixed-timing systems with application to latency-insensitive protocols , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[6]  Diana Marculescu,et al.  Power and performance evaluation of globally asynchronous locally synchronous processors , 2002, ISCA.

[7]  R. D. Valentine,et al.  The Intel Pentium M processor: Microarchitecture and performance , 2003 .

[8]  Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction , 2003, MICRO.

[9]  Michael L. Scott,et al.  Hiding synchronization delays in a GALS processor microarchitecture , 2004, 10th International Symposium on Asynchronous Circuits and Systems, 2004. Proceedings..

[10]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[11]  José González,et al.  Frontend frequency-voltage adaptation for optimal energy-delay/sup 2/ , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[12]  T. N. Vijaykumar,et al.  Heat-and-run: leveraging SMT and CMP to manage power density through the operating system , 2004, ASPLOS XI.

[13]  Lin Chao,et al.  Compute-Intensive, Highly Parallel Applications and Uses , 2005 .

[14]  Mahmut T. Kandemir,et al.  Exploiting barriers to optimize power consumption of CMPs , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[15]  Margaret Martonosi,et al.  A dynamic compilation framework for controlling microprocessor energy and performance , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[16]  Ravi Rajwar,et al.  The impact of performance asymmetry in emerging multicore architectures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[17]  Coniferous softwood GENERAL TERMS , 2003 .

[18]  Shekhar Borkar Thousand Core ChipsA Technology Perspective , 2007, DAC 2007.

[19]  José González,et al.  Meeting points: Using thread criticality to adapt multicore hardware to parallel regions , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Stefan Rusu,et al.  A 45nm 8-core enterprise Xeon ® processor , 2009 .

[21]  Rajesh Kumar,et al.  A family of 45nm IA processors , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[22]  Margaret Martonosi,et al.  Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors , 2009, ISCA '09.

[23]  Balaram Sinharoy,et al.  POWER7: IBM's next generation server processor , 2010, 2009 IEEE Hot Chips 21 Symposium (HCS).

[24]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[25]  Jonathan Chang,et al.  A 45 nm 8-Core Enterprise Xeon¯ Processor , 2010, IEEE J. Solid State Circuits.