Enabling an OpenCL Compiler for Embedded Multicore DSP Systems

OpenCL is an industry's attempt to unify heterogeneous multicore programming. With its programming model defining SPMD kernels, vector types, and address space qualifiers, OpenCL allows programmers to exploit data parallelism with multicore processors and SIMD instructions as well as data locality with memory hierarchy. Recently, OpenCL has gained success on many architectures, including multicore CPUs, GPUs, vector processors, embedded systems with application-specific processors, and even FPGAs. However, how to support OpenCL for embedded multicore DSP systems remains unaddressed. In this paper, we illustrate our OpenCL support for embedded multicore DSP systems. Our target platform consists of one MPU and a DSP subsystem with multiple DSPs. The DSPs we address are VLIW processors with clustered functional units and distributed register files. To generate efficient code for such DSPs, compilers are required to consider irregular register file access in many optimization phases. To utilize the DSPs with distributed register files, we propose a cluster-aware work-item dispatching scheme to vectorize OpenCL kernels and assign independent workload to clusters of a DSP. In addition, we also incorporate several optimizations to enable efficient DSP code generation. In our experiments, we employ a set of OpenCL benchmark programs to evaluate the effectiveness of our OpenCL support. The experiments are conducted on a DSP cycle-accurate simulator and a multicore evaluation board. We report average 29% performance improvement with our vectorization scheme and a near 2-fold speedup with two DSPs compared with a single-MPU setup.

[1]  Mike Murphy,et al.  Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.

[2]  Jenq Kuen Lee,et al.  LC-GRFA: global register file assignment with local consciousness for VLIW DSP processors with non-uniform register files , 2009 .

[3]  Hiroaki Kobayashi,et al.  A Prototype Implementation of OpenCL for SX Vector Systems , 2011 .

[4]  Jong-Deok Choi,et al.  An OpenCL framework for heterogeneous multicores with local memory , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Jenq Kuen Lee,et al.  Parallel Architecture Core (PAC)—the First Multicore Application Processor SoC in Taiwan Part I: Hardware Architecture & Software Development Tools , 2011, J. Signal Process. Syst..

[6]  Wen-mei W. Hwu,et al.  MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs , 2008, LCPC.

[7]  Paola Batistoni,et al.  International Conference , 2001 .

[8]  Jenq Kuen Lee,et al.  LC‐GRFA: global register file assignment with local consciousness for VLIW DSP processors with non‐uniform register files , 2009, Concurr. Comput. Pract. Exp..

[9]  Chi-Bang Kuan,et al.  Compiler supports for VLIW DSP processors with SIMD intrinsics , 2012, Concurr. Comput. Pract. Exp..

[10]  Won So,et al.  Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP , 2006, CASES '06.

[11]  Jaewook Shin Introducing Control Flow into Vectorized Code , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[12]  Jenq Kuen Lee,et al.  PALF: compiler supports for irregular register files in clustered VLIW DSP processors , 2007, Concurr. Comput. Pract. Exp..

[13]  Fernando Magno Quintão Pereira,et al.  Divergence Analysis and Optimizations , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[14]  Sebastian Hack,et al.  Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).