论文信息 - Improving Energy Efficiency of Coarse-Grain Reconfigurable Arrays Through Modulo Schedule Compression/Decompression

Improving Energy Efficiency of Coarse-Grain Reconfigurable Arrays Through Modulo Schedule Compression/Decompression

Modulo-scheduled course-grain reconfigurable array (CGRA) processors excel at exploiting loop-level parallelism at a high performance per watt ratio. The frequent reconfiguration of the array, however, causes between 25% and 45% of the consumed chip energy to be spent on the instruction memory and fetches therefrom. This article presents a hardware/software codesign methodology for such architectures that is able to reduce both the size required to store the modulo-scheduled loops and the energy consumed by the instruction decode logic. The hardware modifications improve the spatial organization of a CGRA’s execution plan by reorganizing the configuration memory into separate partitions based on a statistical analysis of code. A compiler technique optimizes the generated code in the temporal dimension by minimizing the number of signal changes. The optimizations achieve, on average, a reduction in code size of more than 63% and in energy consumed by the instruction decode logic by 70% for a wide variety of application domains. Decompression of the compressed loops can be performed in hardware with no additional latency, rendering the presented method ideal for low-power CGRAs running at high frequencies. The presented technique is orthogonal to dictionary-based compression schemes and can be combined to achieve a further reduction in code size.

Bernhard Egger | Hochan Lee | Mansureh S. Moghaddam | Dongkwan Suh

[1] Ahmed Hemani,et al. Control Scheme for a CGRA , 2010, 2010 22nd International Symposium on Computer Architecture and High Performance Computing.

[2] Jörg Henkel,et al. Design of an one-cycle decompression hardware for performance increase in embedded systems , 2002, DAC '02.

[3] Jos Huisken,et al. A scalable implementation of a reconfigurable WCDMA RAKE receiver , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[4] Andrew Wolfe,et al. Executing compressed programs on an embedded RISC architecture , 1992, MICRO.

[5] B. Ramakrishna Rau,et al. Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[6] Nagisa Ishiura,et al. Instruction Code Compression for Application Specific VLIW Processors Based on Automatic Field Partitioning , 2007 .

[7] Tack-Don Han,et al. SGRT: a mobile GPU architecture for real-time ray tracing , 2013, HPG '13.

[8] Soojung Ryu,et al. Design space exploration and implementation of a high performance and low area Coarse Grained Reconfigurable Processor , 2012, 2012 International Conference on Field-Programmable Technology.

[9] Youngsam Shin,et al. Full-stream architecture for ray tracing with efficient data transmission , 2014, 2014 IEEE International Symposium on Circuits and Systems (ISCAS).

[10] Ingo Sander,et al. System level synthesis of hardware for DSP applications using pre-characterized function implementations , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[11] Diederik Verkest,et al. Energy aware interconnect exploration of coarse grained reconfigurable processors , 2005 .

[12] Takashi Nishimura,et al. Power reduction techniques for Dynamically Reconfigurable Processor Arrays , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[13] Sumedh W. Sathaye,et al. Instruction fetch mechanisms for VLIW architectures with compressed encodings , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[14] ArslanTughrul,et al. Code compression and decompression for coarse-grain reconfigurable architectures , 2008 .

[15] Rudy Lauwereins,et al. ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix , 2003, FPL.

[16] Kiyoung Choi,et al. FloRA: Coarse-grained reconfigurable architecture with floating-point operation capability , 2009, 2009 International Conference on Field-Programmable Technology.

[17] Soojung Ryu,et al. Adaptive compression for instruction code of Coarse Grained Reconfigurable Architectures , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[18] Youngsam Shin,et al. A novel mobile GPU architecture based on ray tracing , 2013, 2013 IEEE International Conference on Consumer Electronics (ICCE).

[19] Gerard J. M. Smit,et al. Montium - Balancing between Energy-Efficiency, Flexibility and Performance , 2003, Engineering of Reconfigurable Systems and Algorithms.

[20] Kiyoung Choi,et al. Low Power Reconfiguration Technique for Coarse-Grained Reconfigurable Architecture , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[21] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .

[22] Youngsam Shin,et al. SGRT: a scalable mobile GPU architecture based on ray tracing , 2012, SIGGRAPH Talks.

[23] Scott A. Mahlke,et al. CGRA express: accelerating execution using dynamic operation fusion , 2009, CASES '09.

[24] Rudy Lauwereins,et al. Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling , 2003, DATE.

[25] William J. Dally,et al. The Imagine Stream Processor , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[26] Kunle Olukotun,et al. Plasticine: A reconfigurable architecture for parallel patterns , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[27] Fadi J. Kurdahi,et al. MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications , 2000, IEEE Trans. Computers.

[28] Atul K. Jain,et al. Minimizing power consumption in scan testing: pattern generation and DFT techniques , 2004 .

[29] Scott A. Mahlke,et al. Recurrence cycle aware modulo scheduling for coarse-grained reconfigurable architectures , 2009, LCTES '09.

[30] Markus Weinhardt,et al. PACT XPP—A Self-Reconfigurable Data Processing Architecture , 2004, The Journal of Supercomputing.

[31] Ahmed Hemani,et al. Parallel distributed scalable runtime address generation scheme for a coarse grain reconfigurable computation and storage fabric , 2014, Microprocess. Microsystems.

[32] Tughrul Arslan,et al. Code Compression and Decompression for Instruction Cell Based Reconfigurable Systems , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[33] Prabhat Mishra,et al. Bitmask-based control word compression for NISC architectures , 2009, GLSVLSI '09.

[34] Tughrul Arslan,et al. Code Compression and Decompression for Coarse-Grain Reconfigurable Architectures , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[35] Soonhoi Ha,et al. A space- and energy-efficient code compression/decompression technique for coarse-grained reconfigurable architectures , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[36] Youngsam Shin,et al. Real-time ray tracing on coarse-grained reconfigurable processor , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[37] Rudy Lauwereins,et al. A Coarse-Grained Array Accelerator for Software-Defined Radio Baseband Processing , 2008, IEEE Micro.

[38] Hannu Tenhunen,et al. Compression Based Efficient and Agile Configuration Mechanism for Coarse Grained Reconfigurable Architectures , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[39] Soojung Ryu,et al. Efficient code compression for coarse grained reconfigurable architectures , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[40] Mario Konijnenburg,et al. ULP-SRP: Ultra low power Samsung Reconfigurable Processor for biomedical applications , 2012, 2012 International Conference on Field-Programmable Technology.

[41] Bjorn De Sutter,et al. A Bimodal Scheduler for Coarse-Grained Reconfigurable Arrays , 2016, TACO.

[42] Tughrul Arslan,et al. Code Compressor and Decompressor for Ultra Large Instruction Width Coarse-Grain Reconfigurable Systems , 2007, 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007).

[43] Mario Konijnenburg,et al. ULP-SRP: Ultra low power Samsung Reconfigurable Processor for biomedical applications , 2012, 2012 International Conference on Field-Programmable Technology.

[44] Hideharu Amano,et al. RoMultiC: fast and simple configuration data multicasting scheme for coarse grain reconfigurable devices , 2005, Proceedings. 2005 IEEE International Conference on Field-Programmable Technology, 2005..

[45] Zhiyuan Li,et al. Don't Care discovery for FPGA configuration compression , 1999, FPGA '99.

[46] Daniel Gajski,et al. FPGA-friendly code compression for horizontal microcoded custom IPs , 2007, FPGA '07.

[47] Kiyoung Choi,et al. An approach to code compression for CGRA , 2011, 2011 3rd Asia Symposium on Quality Electronic Design (ASQED).

[48] Yunheung Paek,et al. Power-Conscious Configuration Cache Structure and Code Mapping for Coarse-Grained Reconfigurable Architecture , 2006, ISLPED'06 Proceedings of the 2006 International Symposium on Low Power Electronics and Design.

[49] Tughrul Arslan,et al. The Reconfigurable Instruction Cell Array , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.