Improving Energy Efficiency of Coarse-Grain Reconfigurable Arrays Through Modulo Schedule Compression/Decompression

Modulo-scheduled course-grain reconfigurable array (CGRA) processors excel at exploiting loop-level parallelism at a high performance per watt ratio. The frequent reconfiguration of the array, however, causes between 25% and 45% of the consumed chip energy to be spent on the instruction memory and fetches therefrom. This article presents a hardware/software codesign methodology for such architectures that is able to reduce both the size required to store the modulo-scheduled loops and the energy consumed by the instruction decode logic. The hardware modifications improve the spatial organization of a CGRA’s execution plan by reorganizing the configuration memory into separate partitions based on a statistical analysis of code. A compiler technique optimizes the generated code in the temporal dimension by minimizing the number of signal changes. The optimizations achieve, on average, a reduction in code size of more than 63% and in energy consumed by the instruction decode logic by 70% for a wide variety of application domains. Decompression of the compressed loops can be performed in hardware with no additional latency, rendering the presented method ideal for low-power CGRAs running at high frequencies. The presented technique is orthogonal to dictionary-based compression schemes and can be combined to achieve a further reduction in code size.

[1]  Ahmed Hemani,et al.  Control Scheme for a CGRA , 2010, 2010 22nd International Symposium on Computer Architecture and High Performance Computing.

[2]  Jörg Henkel,et al.  Design of an one-cycle decompression hardware for performance increase in embedded systems , 2002, DAC '02.

[3]  Jos Huisken,et al.  A scalable implementation of a reconfigurable WCDMA RAKE receiver , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[4]  Andrew Wolfe,et al.  Executing compressed programs on an embedded RISC architecture , 1992, MICRO.

[5]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[6]  Nagisa Ishiura,et al.  Instruction Code Compression for Application Specific VLIW Processors Based on Automatic Field Partitioning , 2007 .

[7]  Tack-Don Han,et al.  SGRT: a mobile GPU architecture for real-time ray tracing , 2013, HPG '13.

[8]  Soojung Ryu,et al.  Design space exploration and implementation of a high performance and low area Coarse Grained Reconfigurable Processor , 2012, 2012 International Conference on Field-Programmable Technology.

[9]  Youngsam Shin,et al.  Full-stream architecture for ray tracing with efficient data transmission , 2014, 2014 IEEE International Symposium on Circuits and Systems (ISCAS).

[10]  Ingo Sander,et al.  System level synthesis of hardware for DSP applications using pre-characterized function implementations , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[11]  Diederik Verkest,et al.  Energy aware interconnect exploration of coarse grained reconfigurable processors , 2005 .

[12]  Takashi Nishimura,et al.  Power reduction techniques for Dynamically Reconfigurable Processor Arrays , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[13]  Sumedh W. Sathaye,et al.  Instruction fetch mechanisms for VLIW architectures with compressed encodings , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[14]  ArslanTughrul,et al.  Code compression and decompression for coarse-grain reconfigurable architectures , 2008 .

[15]  Rudy Lauwereins,et al.  ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix , 2003, FPL.

[16]  Kiyoung Choi,et al.  FloRA: Coarse-grained reconfigurable architecture with floating-point operation capability , 2009, 2009 International Conference on Field-Programmable Technology.

[17]  Soojung Ryu,et al.  Adaptive compression for instruction code of Coarse Grained Reconfigurable Architectures , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[18]  Youngsam Shin,et al.  A novel mobile GPU architecture based on ray tracing , 2013, 2013 IEEE International Conference on Consumer Electronics (ICCE).

[19]  Gerard J. M. Smit,et al.  Montium - Balancing between Energy-Efficiency, Flexibility and Performance , 2003, Engineering of Reconfigurable Systems and Algorithms.

[20]  Kiyoung Choi,et al.  Low Power Reconfiguration Technique for Coarse-Grained Reconfigurable Architecture , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[21]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[22]  Youngsam Shin,et al.  SGRT: a scalable mobile GPU architecture based on ray tracing , 2012, SIGGRAPH Talks.

[23]  Scott A. Mahlke,et al.  CGRA express: accelerating execution using dynamic operation fusion , 2009, CASES '09.

[24]  Rudy Lauwereins,et al.  Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling , 2003, DATE.

[25]  William J. Dally,et al.  The Imagine Stream Processor , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[26]  Kunle Olukotun,et al.  Plasticine: A reconfigurable architecture for parallel patterns , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[27]  Fadi J. Kurdahi,et al.  MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications , 2000, IEEE Trans. Computers.

[28]  Atul K. Jain,et al.  Minimizing power consumption in scan testing: pattern generation and DFT techniques , 2004 .

[29]  Scott A. Mahlke,et al.  Recurrence cycle aware modulo scheduling for coarse-grained reconfigurable architectures , 2009, LCTES '09.

[30]  Markus Weinhardt,et al.  PACT XPP—A Self-Reconfigurable Data Processing Architecture , 2004, The Journal of Supercomputing.

[31]  Ahmed Hemani,et al.  Parallel distributed scalable runtime address generation scheme for a coarse grain reconfigurable computation and storage fabric , 2014, Microprocess. Microsystems.

[32]  Tughrul Arslan,et al.  Code Compression and Decompression for Instruction Cell Based Reconfigurable Systems , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[33]  Prabhat Mishra,et al.  Bitmask-based control word compression for NISC architectures , 2009, GLSVLSI '09.

[34]  Tughrul Arslan,et al.  Code Compression and Decompression for Coarse-Grain Reconfigurable Architectures , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[35]  Soonhoi Ha,et al.  A space- and energy-efficient code compression/decompression technique for coarse-grained reconfigurable architectures , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[36]  Youngsam Shin,et al.  Real-time ray tracing on coarse-grained reconfigurable processor , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[37]  Rudy Lauwereins,et al.  A Coarse-Grained Array Accelerator for Software-Defined Radio Baseband Processing , 2008, IEEE Micro.

[38]  Hannu Tenhunen,et al.  Compression Based Efficient and Agile Configuration Mechanism for Coarse Grained Reconfigurable Architectures , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[39]  Soojung Ryu,et al.  Efficient code compression for coarse grained reconfigurable architectures , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[40]  Mario Konijnenburg,et al.  ULP-SRP: Ultra low power Samsung Reconfigurable Processor for biomedical applications , 2012, 2012 International Conference on Field-Programmable Technology.

[41]  Bjorn De Sutter,et al.  A Bimodal Scheduler for Coarse-Grained Reconfigurable Arrays , 2016, TACO.

[42]  Tughrul Arslan,et al.  Code Compressor and Decompressor for Ultra Large Instruction Width Coarse-Grain Reconfigurable Systems , 2007, 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007).

[43]  Mario Konijnenburg,et al.  ULP-SRP: Ultra low power Samsung Reconfigurable Processor for biomedical applications , 2012, 2012 International Conference on Field-Programmable Technology.

[44]  Hideharu Amano,et al.  RoMultiC: fast and simple configuration data multicasting scheme for coarse grain reconfigurable devices , 2005, Proceedings. 2005 IEEE International Conference on Field-Programmable Technology, 2005..

[45]  Zhiyuan Li,et al.  Don't Care discovery for FPGA configuration compression , 1999, FPGA '99.

[46]  Daniel Gajski,et al.  FPGA-friendly code compression for horizontal microcoded custom IPs , 2007, FPGA '07.

[47]  Kiyoung Choi,et al.  An approach to code compression for CGRA , 2011, 2011 3rd Asia Symposium on Quality Electronic Design (ASQED).

[48]  Yunheung Paek,et al.  Power-Conscious Configuration Cache Structure and Code Mapping for Coarse-Grained Reconfigurable Architecture , 2006, ISLPED'06 Proceedings of the 2006 International Symposium on Low Power Electronics and Design.

[49]  Tughrul Arslan,et al.  The Reconfigurable Instruction Cell Array , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.