Energy aware synthesis of application kernels through composition of data-paths on a CGRA

Transistor supply voltages no longer scales at the same rate as transistor density and frequency of operation. This has led to the Dark Silicon problem, wherein only a fraction of transistors can operate at maximum frequency and nominal voltage, in order to ensure that the chip functions within the power and thermal budgets. Heterogeneous computing systems which consist of General Purpose Processors (GPPs), Graphic Processing Units (GPUs) and application specific accelerators can provide improved performance while keeping power dissipation at a realistic level. For the accelerators to be effective, they have to be specialized for related classes of application kernels and have to be synthesized from high level specifications. Coarse Grained Reconfigurable has been proposed as accelerators for a variety of application kernels. For CGRAs to be used as accelerators in the Dark Silicon era, a synthesis framework which focuses on optimizing energy efficiency, while achieving the target performance is essential. However, existing compilation techniques for CGRAs focuses on optimizing only for performance, and any reduction in energy is just a side-effect. In this paper we explore synthesizing application kernels expressed as functions, on a coarse grained composable reconfigurable array (CGCRA). The proposed reconfigurable array comprises HyperCells, which are reconfigurable macro-cells that facilitate modeling power and performance in terms of easily measurable parameters. The proposed synthesis approach takes kernels expressed in a functional language, applies a sequence of well known program transformations, explores trade-offs between throughput and energy using the power and performance models, and realizes the kernels on the CGCRA. This approach when used to map a set of signal processing and linear algebra kernels achieves resource utilization varying from 50% to 80%.

[1]  Kavitha T. Madhu,et al.  Energy Aware Synthesis of Application Kernels Expressed in Functional Languages on a Coarse Grained Composable Reconfigurable Array , 2015, 2015 IEEE International Symposium on Nanoelectronic and Information Systems.

[2]  Paolo Bientinesi,et al.  Scalable and Efficient Linear Algebra Kernel Mapping for Low Energy Consumption on the Layers CGRA , 2015, ARC.

[3]  Simha Sethumadhavan,et al.  Distributed Microarchitectural Protocols in the TRIPS Prototype Processor , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[4]  Giovanni De Micheli,et al.  Synthesis and Optimization of Digital Circuits , 1994 .

[5]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[6]  Amin Ansari,et al.  Bundled execution of recurring traces for energy-efficient general purpose processing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Reza Sedaghat,et al.  Integrated scheduling, allocation and binding in High Level Synthesis for performance-area tradeoff of digital media applications , 2011, 2011 24th Canadian Conference on Electrical and Computer Engineering(CCECE).

[8]  Kavitha T. Madhu,et al.  RHyMe: REDEFINE Hyper Cell Multicore for Accelerating HPC Kernels , 2016, 2016 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID).

[9]  Jaeho Lee,et al.  FPGA-targeted high-level binding algorithm for power and area reduction with glitch-estimation , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[10]  Stamatis Vassiliadis,et al.  The MOLEN polymorphic processor , 2004, IEEE Transactions on Computers.

[11]  Azadeh Davoodi,et al.  Effective techniques for the generalized low-power binding problem , 2006, TODE.

[12]  Tony M. Brewer,et al.  Instruction Set Innovations for the Convey HC-1 Computer , 2010, IEEE Micro.

[13]  S. K. Nandy,et al.  REDEFINE: Runtime reconfigurable polymorphic ASIC , 2009, TECS.

[14]  Ehl Emile Aarts,et al.  Simulated annealing and Boltzmann machines , 2003 .

[15]  Kavitha T. Madhu,et al.  Synthesis of Instruction Extensions on HyperCell, a reconfigurable datapath , 2014, 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV).

[16]  S. K. Nandy,et al.  Generic routing rules and a scalable access enhancement for the Network-on-Chip RECONNECT , 2009, 2009 IEEE International SOC Conference (SOCC).

[17]  Stephen A. Edwards,et al.  Statically Unrolling Recursion to Improve Opportunities for Parallelism , 2012 .

[18]  Richard S. Bird,et al.  Introduction to functional programming , 1988, Prentice Hall International series in computer science.

[19]  S. K. Nandy,et al.  Design of a low power 64 point FFT architecture for WLAN applications , 2013, 2013 25th International Conference on Microelectronics (ICM).

[20]  Philippe Coussy,et al.  High-Level Synthesis , 2008 .

[21]  Kavitha T. Madhu,et al.  A framework for post-silicon realization of arbitrary instruction extensions on reconfigurable data-paths , 2014, J. Syst. Archit..

[22]  Kavitha T. Madhu,et al.  Compiling HPC Kernels for the REDEFINE CGRA , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[23]  Rudy Lauwereins,et al.  ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix , 2003, FPL.

[24]  Philippe Coussy,et al.  High-Level Synthesis: from Algorithm to Digital Circuit , 2008 .

[25]  Hamid Noori,et al.  ALU-array based reconfigurable accelerator for energy efficient executions , 2009, 2009 International SoC Design Conference (ISOCC).

[26]  Gerard J. M. Smit,et al.  A Dataflow Inspired Programming Paradigm for Coarse-Grained Reconfigurable Arrays , 2014, ARC.

[27]  Michael Bedford Taylor,et al.  Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse , 2012, DAC Design Automation Conference 2012.

[29]  Chittaranjan A. Mandal,et al.  GABIND: a GA approach to allocation and binding for the high-level synthesis of data paths , 2000, IEEE Trans. Very Large Scale Integr. Syst..

[30]  Gerard J. M. Smit,et al.  Dataflow-based reconfigurable architecture for streaming applications , 2012, 2012 International Symposium on System on Chip (SoC).

[31]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.