Performance Portability Across Heterogeneous SoCs Using a Generalized Library-Based Approach

Because of tight power and energy constraints, industry is progressively shifting toward heterogeneous system-on-chip (SoC) architectures composed of a mix of general-purpose cores along with a number of accelerators. However, such SoC architectures can be very challenging to efficiently program for the vast majority of programmers, due to numerous programming approaches and languages. Libraries, on the other hand, provide a simple way to let programmers take advantage of complex architectures, which does not require programmers to acquire new accelerator-specific or domain-specific languages. Increasingly, library-based, also called algorithm-centric, programming approaches propose to generalize the usage of libraries and to compose programs around these libraries, instead of using libraries as mere complements. In this article, we present a software framework for achieving performance portability by leveraging a generalized library-based approach. Inspired by the notion of a component, as employed in software engineering and HW/SW codesign, we advocate nonexpert programmers to write simple wrapper code around existing libraries to provide simple but necessary semantic information to the runtime. To achieve performance portability, the runtime employs machine learning (simulated annealing) to select the most appropriate accelerator and its parameters for a given algorithm. This selection factors in the possibly complex composition of algorithms used in the application, the communication among the various accelerators, and the tradeoff between different objectives (i.e., accuracy, performance, and energy). Using a set of benchmarks run on a real heterogeneous SoC composed of a multicore processor and a GPU, we show that the runtime overhead is fairly small at 5.1% for the GPU and 6.4% for the multi-core. We then apply our accelerator selection approach to a simulated SoC platform containing multiple inexact accelerators. We show that accelerator selection together with hardware parameter tuning achieves an average 46.2% energy reduction and a speedup of 2.1× while meeting the desired application error target.

[1]  Paolo Ienne,et al.  Elastic CGRAs , 2013, FPGA '13.

[2]  Luca Benini,et al.  Component selection and matching for IP-based design , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[3]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[4]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Zhen Wang,et al.  Reflex: using low-power processors in smartphones without knowing them , 2012, ASPLOS XVII.

[6]  Luis Ceze,et al.  Architecture support for disciplined approximate programming , 2012, ASPLOS XVII.

[7]  K. McStay,et al.  Scaling deep trench based eDRAM on SOI to 32nm and Beyond , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[8]  Martin C. Rinard Probabilistic accuracy bounds for fault-tolerant computations that discard tasks , 2006, ICS '06.

[9]  Henry Hoffmann,et al.  Dynamic knobs for responsive power-aware computing , 2011, ASPLOS XVI.

[10]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[11]  Lieven Eeckhout,et al.  Iterative optimization for the data center , 2012, ASPLOS XVII.

[12]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[13]  Andrew Richards,et al.  Programmability and performance portability aspects of heterogeneous multi-/manycore systems , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[14]  George T. Heineman,et al.  Component-Based Software Engineering: Putting the Pieces Together , 2001 .

[15]  John Shalf,et al.  SEJITS: Getting Productivity and Performance With Selective Embedded JIT Specialization , 2010 .

[16]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.

[18]  R. Dolbeau,et al.  HMPP TM : A Hybrid Multi-core Parallel Programming Environment , 2022 .

[19]  Mark D. Corner,et al.  Eon: a language and runtime system for perpetual systems , 2007, SenSys '07.

[20]  Alan Edelman,et al.  Language and compiler support for auto-tuning variable-accuracy algorithms , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[21]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[22]  Olivier Temam,et al.  A defect-tolerant accelerator for emerging high-performance applications , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[23]  Henry Hoffmann,et al.  Power Optimization in Embedded Systems via Feedback Control of Resource Allocation , 2013, IEEE Transactions on Control Systems Technology.

[24]  Fei Xie,et al.  Component-Based Hardware/Software Co-Simulation , 2007, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007).

[25]  Saman P. Amarasinghe,et al.  Portable performance on heterogeneous architectures , 2013, ASPLOS '13.

[26]  Martin Rinard,et al.  Using Code Perforation to Improve Performance, Reduce Energy Consumption, and Respond to Failures , 2009 .

[27]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[28]  Zoltán Ádám Mann,et al.  Extending component-based design with hardware components , 2005, Sci. Comput. Program..

[29]  Lingamneni Avinash,et al.  Highly energy and performance efficient embedded computing through approximately correct arithmetic: a mathematical foundation and preliminary experimental validation , 2008, CASES '08.

[30]  Lingamneni Avinash,et al.  Energy parsimonious circuit design through probabilistic pruning , 2011, 2011 Design, Automation & Test in Europe.

[31]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[32]  David A. Padua,et al.  Performance Portability with the Chapel Language , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[33]  Arcot Sowmya,et al.  Automatic component matching using forced simulation , 2000, VLSI Design 2000. Wireless and Digital Imaging in the Millennium. Proceedings of 13th International Conference on VLSI Design.

[34]  Gregory Diamos,et al.  Harmony: an execution model and runtime for heterogeneous many core systems , 2008, HPDC '08.

[35]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[36]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[37]  Henry Hoffmann,et al.  Quality of service profiling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[38]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[39]  David F. Bacon,et al.  Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.

[40]  Ulf Schlichtmann,et al.  Accurately timed transaction level models for virtual prototyping at high abstraction level , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[41]  M. Valero,et al.  Fuzzy memoization for floating-point multimedia applications , 2005, IEEE Transactions on Computers.

[42]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[43]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[44]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[45]  David Villa,et al.  Unified Inter-Communication Architecture for Systems-on-Chip , 2007, 18th IEEE/IFIP International Workshop on Rapid System Prototyping (RSP '07).

[46]  Lieven Eeckhout,et al.  SWAP: Parallelization through Algorithm Substitution , 2012, IEEE Micro.

[47]  A. Choudhary,et al.  A Library Based Compiler to Execute Matlab Programs on a Heterogeneous Platform , 2007 .

[48]  Kunle Olukotun,et al.  A Heterogeneous Parallel Framework for Domain-Specific Languages , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[49]  Benoît Meister,et al.  Runnemede: An architecture for Ubiquitous High-Performance Computing , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[50]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[51]  Nan Jiang,et al.  A detailed and flexible cycle-accurate Network-on-Chip simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[52]  Gabriela Nicolescu,et al.  Component-based design approach for multicore SoCs , 2002, DAC '02.

[53]  Woongki Baek,et al.  Green: a framework for supporting energy-conscious programming using controlled approximation , 2010, PLDI '10.

[54]  Thomas Gschwind,et al.  Composing Distributed Components with the Component Workbench , 2002, SEM.

[55]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[56]  Karthik Pattabiraman,et al.  Flicker: Saving Refresh-Power in Mobile Devices through Critical Data Partitioning , 2009 .

[57]  Michael F. P. O'Boyle,et al.  Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[58]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[59]  Alberto L. Sangiovanni-Vincentelli,et al.  Addressing the system-on-a-chip interconnect woes through communication-based design , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[60]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[61]  Surendra Byna,et al.  Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory , 2010, SPAA '10.