Multiversioned decoupled access-execute: the key to energy-efficient compilation of general-purpose programs

Computer architecture design faces an era of great challenges in an attempt to simultaneously improve performance and energy efficiency. Previous hardware techniques for energy management become severely limited, and thus, compilers play an essential role in matching the software to the more restricted hardware capabilities. One promising approach is software decoupled access-execute (DAE), in which the compiler transforms the code into coarse-grain phases that are well-matched to the Dynamic Voltage and Frequency Scaling (DVFS) capabilities of the hardware. While this method is proved efficient for statically analyzable codes, general-purpose applications pose significant challenges due to pointer aliasing, complex control flow and unknown runtime events. We propose a universal compile-time method to decouple general-purpose applications, using simple but efficient heuristics. Our solutions overcome the challenges of complex code and show that automatic decoupled execution significantly reduces the energy expenditure of irregular or memory-bound applications and even yields slight performance boosts. Overall, our technique achieves over 20% on average energy-delay-product (EDP) improvements (energy over 15% and performance over 5%) across 14 benchmarks from SPEC CPU 2006 and Parboil benchmark suites, with peak EDP improvements surpassing 70%.

[1]  Sharad Malik,et al.  Compile-time dynamic voltage scaling settings: opportunities and limits , 2003, PLDI '03.

[2]  A. Jaleel Memory Characterization of Workloads Using Instrumentation-Driven Simulation A Pin-based Memory Characterization of the SPEC CPU 2000 and SPEC CPU 2006 Benchmark Suites , 2022 .

[3]  Stefanos Kaxiras,et al.  Green governors: A framework for Continuously Adaptive DVFS , 2011, 2011 International Green Computing Conference and Workshops.

[4]  Margaret Martonosi,et al.  Computer Architecture Techniques for Power-Efficiency , 2008, Computer Architecture Techniques for Power-Efficiency.

[5]  Tomofumi Yuki,et al.  Folklore Confirmed: Compiling for Speed = Compiling for Energy , 2013, LCPC.

[6]  Joel H. Saltz,et al.  Run-time parallelization and scheduling of loops , 1989, SPAA '89.

[7]  Vincent Loechner,et al.  Dynamic and Speculative Polyhedral Parallelization Using Compiler-Generated Skeletons , 2013, International Journal of Parallel Programming.

[8]  Josep Torrellas,et al.  An efficient algorithm for the run-time parallelization of DOACROSS loops , 1994, Proceedings of Supercomputing '94.

[9]  David Black-Schaffer,et al.  Towards more efficient execution: a decoupled access-execute approach , 2013, ICS '13.

[10]  Xipeng Shen,et al.  Space-efficient multi-versioning for input-adaptive feedback-driven program optimizations , 2014, OOPSLA.

[11]  MirchandaneyRavi,et al.  Run-Time Parallelization and Scheduling of Loops , 1991 .

[12]  Stijn Eyerman,et al.  A Counter Architecture for Online DVFS Profitability Estimation , 2010, IEEE Transactions on Computers.

[13]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[14]  Barbara G. Ryder,et al.  Blended analysis for performance understanding of framework-based applications , 2007, ISSTA '07.

[15]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[16]  R.H. Dennard,et al.  Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[17]  Martin Hirzel,et al.  Dynamic hot data stream prefetching for general-purpose programs , 2002, PLDI '02.

[18]  K. Steinhubl Design of Ion-Implanted MOSFET'S with Very Small Physical Dimensions , 1974 .

[19]  Yale N. Patt,et al.  Predicting Performance Impact of DVFS for Realistic Memory Systems , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[20]  David Black-Schaffer,et al.  Fix the code. Don't tweak the hardware: A new compiler approach to Voltage-Frequency scaling , 2014, CGO '14.

[21]  Dean M. Tullsen,et al.  Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices , 2005, PLDI '05.

[22]  Shigeru Chiba,et al.  A New Optimization Technique for the Inspector-Executor Method , 2002, IASTED PDCS.

[23]  Mark Horowitz,et al.  Energy dissipation in general purpose microprocessors , 1996, IEEE J. Solid State Circuits.

[24]  Martin Hirzel,et al.  Bursty Tracing: A Framework for Low-Overhead Temporal Profiling , 2001 .

[25]  Grigori Fursin,et al.  Finding representative sets of optimizations for adaptive multiversioning applications , 2009, ArXiv.

[26]  Matthias Hauswirth,et al.  Low-overhead memory leak detection using adaptive statistical profiling , 2004, ASPLOS XI.

[27]  Lingjia Tang,et al.  Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[28]  Vincent Loechner,et al.  VMAD: A virtual machine for advanced dynamic analysis of programs , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[29]  Dean M. Tullsen,et al.  Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.

[30]  Stéphan Jourdan,et al.  Haswell: The Fourth-Generation Intel Core Processor , 2014, IEEE Micro.

[31]  Matthew Arnold,et al.  A framework for reducing the cost of instrumented code , 2001, PLDI '01.

[32]  Juan Touriño,et al.  An Inspector-Executor Algorithm for Irregular Assignment Parallelization , 2004, ISPA.

[33]  P. Sadayappan,et al.  A Compiler Analysis to Determine Useful Cache Size for Energy Efficiency , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[34]  T. K. Prakash,et al.  Performance Characterization of SPEC CPU 2006 Benchmarks on Intel Core 2 Duo Processor , .

[35]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[36]  Nancy M. Amato,et al.  Run-time methods for parallelizing partially parallel loops , 1995, ICS '95.

[37]  Xipeng Shen,et al.  An input-centric paradigm for program dynamic optimizations , 2010, OOPSLA.

[38]  Vincent Loechner,et al.  VMAD: An Advanced Dynamic Program Analysis and Instrumentation Framework , 2012, CC.

[39]  James E. Smith,et al.  Decoupled access/execute computer architectures , 1984, TOCS.

[40]  Satish Narayanasamy,et al.  LiteRace: effective sampling for lightweight data-race detection , 2009, PLDI '09.

[41]  Stefanos Kaxiras,et al.  Interval-based models for run-time DVFS orchestration in superscalar processors , 2010, CF '10.

[42]  Weifeng Zhang,et al.  Accelerating and Adapting Precomputation Threads for Effcient Prefetching , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[43]  Xuan Chen,et al.  Adaptive Multi-versioning for OpenMP Parallelization via Machine Learning , 2009, 2009 15th International Conference on Parallel and Distributed Systems.