OpenCL-based design methodology for application-specific processors

OpenCL is a programming language standard which enables the programmer to express the application by structuring its computation as kernels. The OpenCL compiler is given the explicit freedom to parallelize the execution of kernel instances at all the levels of parallelism. In comparison to the traditional C programming language which is sequential in nature, OpenCL enables higher utilization of parallelism naturally available in hardware constructs while still having a feasible learning curve for engineers familiar with the C language. This paper describes methodology and compiler techniques involved in applying OpenCL as an input language for a design flow of application-specific processors. At the core of the methodology is a whole program optimizing compiler that links together the host and kernel codes of the input OpenCL program and parallelizes the result on a customized statically scheduled processor. The OpenCL vendor extension mechanism is used to provide clean access to custom operations. The methodology is studied with a design case to verify the scalability of the implementation at the instruction level and to exemplify the use of custom operations. The case shows that the use of OpenCL allows producing scalable application-specific processor designs and makes it possible to gradually reach the performance of hand-tailored RTL designs by exploiting the OpenCL extension mechanism to access custom hardware operations of varying complexity.

[1]  Jarmo Takala,et al.  Reducing processor energy consumption by compiler optimization , 2009, 2009 IEEE Workshop on Signal Processing Systems.

[2]  Vivek Tiwari,et al.  Reducing power in high-performance microprocessors , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[3]  Henk Corporaal,et al.  Register file port requirements of transport triggered architectures , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Vivek Sarkar,et al.  Linear scan register allocation , 1999, TOPL.

[5]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[6]  Jarmo Takala,et al.  Programmable and Scalable Architecture for Graphics Processing Units , 2009, SAMOS.

[7]  Guido Bertoni,et al.  Efficient Software Implementation of AES on 32-Bit Platforms , 2002, CHES.

[8]  Frederico Pratas,et al.  Applying the Stream-Based Computing Model to Design Hardware Accelerators: A Case Study , 2009, SAMOS.

[9]  Lawrence Rauchwerger,et al.  Automatic Detection of Parallelism: A grand challenge for high performance computing , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[10]  Frances E. Allen,et al.  Control-flow analysis , 2022 .

[11]  Dennis Ritchie,et al.  The development of the C language , 1993, HOPL-II.

[12]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[13]  Scott A. Mahlke,et al.  Predicate-aware scheduling: a technique for reducing resource constraints , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[14]  Henk Corporaal,et al.  TTAs: Missing the ILP complexity wall , 1999, J. Syst. Archit..

[15]  Jason Cong,et al.  FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.

[16]  Henk Corporaal,et al.  Partitioned register file for TTAs , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[17]  Henk Corporaal Microprocessor architectures - from VLIW to TTA , 1997 .

[18]  Jarmo Takala,et al.  Codesign toolset for application-specific instruction-set processors , 2007, Electronic Imaging.

[19]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[20]  Henk Corporaal,et al.  Automatic Synthesis of Transport Triggered Processors , 1995 .

[21]  Wen-mei W. Hwu,et al.  MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs , 2008, LCPC.