Energy Optimization by Software Prefetching for Task Granularity in GPU-Based Embedded Systems

Energy saving and optimization play an increasingly important role in industrial electronic systems. A heterogeneous embedded system is composed of a general-purpose central processing unit with an enhanced module of graphics processing units (GPU). This article explores the effective strategies of task granularity and software prefetching for energy optimization. In this article, we propose a novel energy optimization model for GPU-based embedded systems by harnessing a communication-based pipeline spatial and temporal relation. We analyze the characteristics of a multiple thread execution of parallel GPUs. We present an effective algorithm for the dynamic power optimization with the adaptively adjusted distance of software prefetching. The experimental results show that the dynamic energy consumption can be saved by 22.1% and 21.8%, respectively, under two prefetching strategies (register and shared memory) without loss of performance. We demonstrate the effectiveness of the proposed methods for energy saving and consumption reduction of performance driven computing in industrial scenarios.

[1]  Feng Zhao,et al.  Energy-optimal software partitioning in heterogeneous multiprocessor embedded systems , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[2]  Naixue Xiong,et al.  An architecture‐level graphics processing unit energy model , 2016, Concurr. Comput. Pract. Exp..

[3]  David R. Kaeli,et al.  Power Analysis Attack of an AES GPU Implementation , 2018, J. Hardw. Syst. Secur..

[4]  Donald Yeung,et al.  Transferring performance gain from software prefetching to energy reduction , 2004, 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512).

[5]  Jing Chen,et al.  A Dynamic Power Management Mechanism for Embedded System with Micro-Kernel Operating System , 2013 .

[6]  Thomas D. Burd,et al.  Energy efficient CMOS microprocessor design , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[7]  K. Ramani,et al.  PowerRed : A Flexible Modeling Framework for Power Efficiency Exploration in GPUs , .

[8]  Hong Liu,et al.  Employing multi-GPU power for molecular dynamics simulation: an extension of GALAMOST , 2018 .

[9]  Keqin Li,et al.  Performance Analysis of Power-Aware Task Scheduling Algorithms on Multiprocessor Computers with Dynamic Voltage and Speed , 2008, IEEE Transactions on Parallel and Distributed Systems.

[10]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[11]  Alireza Jalilian,et al.  Flexible Fractional Compensating Mode for Railway Static Power Conditioner in a V/v Traction Power Supply System , 2018, IEEE Transactions on Industrial Electronics.

[12]  Keqin Li,et al.  Energy efficient scheduling of parallel tasks on multiprocessor computers , 2012, The Journal of Supercomputing.

[13]  Henry M. Levy,et al.  An Architecture for Software-Controlled Data Prefetching , 1991, ISCA.

[14]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[15]  Xiao Qin,et al.  GreenDB: Energy-Efficient Prefetching and Caching in Database Clusters , 2019, IEEE Transactions on Parallel and Distributed Systems.

[16]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[17]  Todd C. Mowry,et al.  Improving index performance through prefetching , 2001, SIGMOD '01.

[18]  Kenli Li,et al.  GFlink: An In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data , 2016, IEEE Transactions on Parallel and Distributed Systems.

[19]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[20]  Keqin Li,et al.  GFlink: An In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data , 2016, ICPP.

[21]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[22]  Hao Wang,et al.  Three-level performance optimization for heterogeneous systems based on software prefetching under power constraints , 2018, Future Gener. Comput. Syst..

[23]  Luca Benini,et al.  A Hybrid Instruction Prefetching Mechanism for Ultra Low-Power Multicore Clusters , 2017, IEEE Embedded Systems Letters.