Investigation on the power efficiency of multi-core and GPU Processing Element in large scale SIMD computation with CUDA

CPU-GPU Processing Element (PE) has become a very popular architecture to construct modern multiprocessing system because of its high performance on massively parallel processing and vector computations. Power dissipation is one of the important factors influencing design development of High Performance Computing (HPC) as a large scale scientific computation may use thousands of processors and hundreds hours of continuous execution that will result enormous energy predicament. Enhancing the utilizations of an individual PE to reach its best computation capability and power efficiency is valuable for saving the overall power cost of large multi-processing systems. Power performance of a CUDA PE is dependent on electrical features of the inside hardware components and their interconnections; also high level applications and the parallel algorithms performed on it. Based on measurements and experimental evaluations, in this work we provide a load sharing method to adjust the workload assignment within the CPU and GPU components inside a CUDA PE in order to optimize the overall power efficiency. The improvement on computation time and power consumption has been validated by examining the program executions when above method is applied on real systems.

[1]  Thomas A. DeMassa,et al.  Digital Integrated Circuits , 1985, 1985 IEEE GaAs IC Symposium Technical Digest.

[2]  Srivaths Ravi,et al.  Efficient RTL power estimation for large designs , 2003, 16th International Conference on VLSI Design, 2003. Proceedings..

[3]  Luca Benini,et al.  Statistical Power Estimation of Behavioral Descriptions , 2003, PATMOS.

[4]  Michael C. Huang,et al.  The thrifty barrier: energy-aware synchronization in shared-memory multiprocessors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[5]  Earl E. Swartzlander,et al.  Bridge Floating-Point Fused Multiply-Add Design , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[6]  Bronis R. de Supinski,et al.  Adagio: making DVS practical for complex HPC applications , 2009, ICS.

[7]  Reiji Suda,et al.  Power Efficient Large Matrices Multiplication by Load Scheduling on Multi-core and GPU Platform with CUDA , 2009, 2009 International Conference on Computational Science and Engineering.