论文信息 - A sparse VLIW instruction encoding scheme compatible with generic binaries

A sparse VLIW instruction encoding scheme compatible with generic binaries

Very Long Instruction Word (VLIW) processors are commonplace in embedded systems due to their inherent lowpower consumption as the instruction scheduling is performed by the compiler instead by sophisticated and power-hungry hardware instruction schedulers used in their RISC counterparts. This is achieved by maximizing resource utilization by only targeting a certain application domain. However, when the inherent application ILP (instruction-level parallelism) is low, resources are under-utilized/wasted and the encoding of NOPs results in large code sizes and consequently additional pressure on the memory subsystem to store these NOPs. To address the resource-utilization issue, we proposed a dynamic VLIW processor design that can merge unused resources to form additional cores to execute more threads. Therefore, the formation of cores can result in issue widths of 2, 4, and 8. Without sacrificing the possibility of code interruptability and resumption, we proposed a generic binary scheme that allows a single binary to be executed on these different issue-width cores. However, the code size issue remains as the generic binary scheme even slightly further increases the number NOPS. Therefore, in this paper, we propose to apply a well-known stop-bit code compression technique to the generic binaries that, most importantly, maintains its code compatibility characteristic allowing it to be executed on different cores. In addition, we present the hardware designs to support this technique in our dynamic core. For prototyping purposes, we implemented our design on a Xilinx Virtex-6 FPGA device and executed 14 embedded benchmarks. For comparison, we selected a nondynamic/ static VLIW core that incorporates a similar stop-bit technique for its code compression. We demonstrate, while maintaining code compatibility on top of a flexible dynamic VLIW processor, that the code size can be significantly reduced (up to 80%) resulting in energy savings, and that the performance can be increased (up to a factor of three). Finally, our experimental results show that we can use smaller caches (2 to 4 times as small), which will further help in decreasing energy consumption.

[1] Ulrich Rückert,et al. CoreVA: A Configurable Resource-Efficient VLIW Processor Architecture , 2014, 2014 12th IEEE International Conference on Embedded and Ubiquitous Computing.

[2] Sanjay Ranka,et al. Dynamic cache reconfiguration and partitioning for energy optimization in real-time multi-core systems , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[3] Paolo Faraboschi,et al. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools , 2004 .

[4] Geoffrey Brown,et al. Lx: a technology platform for customizable VLIW embedded processing , 2000, ISCA '00.

[5] Frank Vahid,et al. A highly configurable cache for low energy embedded systems , 2005, TECS.

[6] Stamatis Vassiliadis,et al. The TM3270 media-processor , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[7] Stephan Wong,et al. Support for dynamic issue width in VLIW processors using generic binaries , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[8] Harsh Sharangpani,et al. Itanium Processor Microarchitecture , 2000, IEEE Micro.

[9] Atsuhiro Suga,et al. Introducing the FR500 Embedded Microprocessor , 2000, IEEE Micro.

[10] Sumedh W. Sathaye,et al. Instruction fetch mechanisms for VLIW architectures with compressed encodings , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[11] Jung Ho Ahn,et al. A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies , 2008, 2008 International Symposium on Computer Architecture.

[12] Gul A. Agha,et al. Towards optimizing energy costs of algorithms for shared memory architectures , 2010, SPAA '10.

[13] Marc Tremblay,et al. The MAJC Architecture: A Synthesis of Parallelism and Scalability , 2000, IEEE Micro.

[14] Kannappan Palaniappan,et al. Performance evaluation for a compressed-VLIW processor , 2002, SAC '02.