OpenCL code generation for low energy wide SIMD architectures with explicit datapath

Energy efficiency is one of the most important aspects in designing embedded processors. The use of a wide SIMD processor architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, we propose a configurable wide SIMD architecture that utilizes explicit datapath to achieve high energy efficiency. To efficiently program the proposed architecture with a standard parallel programming language, we introduce a tool flow that can compile and map OpenCL programs onto it. The compiler in the proposed tool flow is able to analyze the static access patterns in OpenCL kernels and generate efficient mapping and code that utilizes the explicit datapath. Experimental results show that the proposed architecture is efficient. In a 128-PE processor, the proposed architecture is able to achieve over 200 times speed-up and reduce the energy consumption of register file and memory by over 90% compared to a RISC processor.

[1]  Yifan He,et al.  Scheduling for register file energy minimization in explicit datapath architectures , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[2]  Jarmo Takala,et al.  Customized Exposed Datapath Soft-Core Design Flow with Compiler Support , 2010, 2010 International Conference on Field Programmable Logic and Applications.

[3]  Yifan He,et al.  SIMD made explicit , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[4]  Paul Wielage,et al.  XETAL-II: A 107 GOPS, 600mW Massively-Parallel Processor for Video Scene Analysis , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[5]  Yifan He,et al.  Energy efficient special instruction support in an embedded processor with compact isa , 2012, CASES '12.

[6]  Muhsen Owaida,et al.  Synthesis of Platform Architectures from OpenCL Programs , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.

[7]  Jarmo Takala,et al.  OpenCL-based design methodology for application-specific processors , 2010, 2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[8]  J. Heikkinen,et al.  Dictionary-based program compression on TTAs: effects on area and power consumption , 2005, IEEE Workshop on Signal Processing Systems Design and Implementation, 2005..

[9]  R.P. Kleihorst,et al.  Xetal-II: A 107 GOPS, 600 mW Massively Parallel Processor for Video Scene Analysis , 2008, IEEE Journal of Solid-State Circuits.

[10]  Emmett Kilgariff,et al.  Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[11]  Sebastian Hack,et al.  Improving Performance of OpenCL on CPUs , 2012, CC.

[12]  Henk Corporaal Microprocessor architectures - from VLIW to TTA , 1997 .

[13]  David Black-Schaffer,et al.  An Energy-Efficient Processor Architecture for Embedded Systems , 2008, IEEE Computer Architecture Letters.

[14]  Yifan He,et al.  Xetal-Pro: An ultra-low energy and high throughput SIMD processor , 2010, Design Automation Conference.

[15]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[16]  Scott A. Mahlke,et al.  AnySP: Anytime Anywhere Anyway Signal Processing , 2010, IEEE Micro.

[17]  Ian Finlayson,et al.  An Overview of Static Pipelining , 2012, IEEE Computer Architecture Letters.

[18]  William J. Dally,et al.  Operand Registers and Explicit Operand Forwarding , 2009, IEEE Computer Architecture Letters.

[19]  Henk Corporaal,et al.  MOVE-Pro: A low power and high code density TTA architecture , 2011, 2011 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[20]  Shorin Kyo,et al.  IMAPCAR: A 100 GOPS In-Vehicle Vision Processor Based on 128 Ring Connected Four-Way VLIW Processing Elements , 2011, J. Signal Process. Syst..

[21]  Jarmo Takala,et al.  Reducing processor energy consumption by compiler optimization , 2009, 2009 IEEE Workshop on Signal Processing Systems.