Energy efficient tiling on a Many-Core Architecture

Energy efficiency and power consumption have become an imperative requirement in Computer Architecture. The rising multi-core and many-core era has been motivated by the increasing demand of high performance computations restricted to a feasible power requirement. How to model the energy consumption of many-core architectures in order to propose techniques for the design of energy efficient applications is a topic of high interest in the community. In this paper, we develop an energy consumption model for many-core architectures with software-managed memory hierarchy and we propose a general methodology for designing tiling techniques for energy efficient applications. The energy consumption model developed and the methodology proposed have the following characteristics: (1) The energy consumption model depends of the number and type of instructions executed and the total execution time of the application. (2) This model is scalable with the number of hardware thread units and considers stalls produced by data dependencies or arbitration of shared resources. (3) The methodology proposed is based on an optimization problem that produces optimal tiling and sequence of traversing tiles minimizing the energy consumed and parametrized by the sizes of each level in the memory hierarchy. (4) We show two different techniques for solving the optimization problem for two different applications: Matrix Multiplication (MM) and Finite Difference Time Domain (FDTD). Our experimental evaluation on a real IBM Cyclops-64 chip (C64) proves the accuracy of our energy consumption model and shows that the techniques proposed reduce the total energy consumption and also increase the power efficiency.

[1]  K. Yee Numerical solution of initial boundary value problems involving maxwell's equations in isotropic media , 1966 .

[2]  Allan Porterfield,et al.  Data cache performance of supercomputer applications , 1990, Proceedings SUPERCOMPUTING '90.

[3]  F. Frances Yao,et al.  A scheduling model for reduced CPU energy , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[4]  Hiroshi Nakamura,et al.  SCIMA: a novel processor architecture for high performance computing , 2000, Proceedings Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region.

[5]  Vikas Agarwal,et al.  Static energy reduction techniques for microprocessor caches , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[6]  Sang Lyul Min,et al.  An Accurate Instruction-Level Energy Consumption Model for Embedded RISC Processors , 2001 .

[7]  Guang R. Gao,et al.  Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences , 2006, Euro-Par.

[8]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[9]  Guang R. Gao,et al.  Optimizing the Fast Fourier Transform on a Multi-core Architecture , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[10]  Petru Eles,et al.  Energy Optimization of Multiprocessor Systems on Chip by Voltage Selection , 2007, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[11]  Guang R. Gao,et al.  Mapping the LU decomposition on a many-core architecture: challenges and solutions , 2009, CF '09.

[12]  Josep Torrellas Architectures for Extreme-Scale Computing , 2009, Computer.

[13]  Guang R. Gao,et al.  Locality Optimization of Stencil Applications Using Data Dependency Graphs , 2010, LCPC.

[14]  Guang R. Gao,et al.  Optimized Dense Matrix Multiplication on a Many-Core Architecture , 2010, Euro-Par.

[15]  Guang R. Gao,et al.  Computer Architecture and Parallel Systems Laboratory Dynamic Percolation-Mapping Dense Matrix Multiplication on a Many-Core Architecture , 2010 .