Unrolling and retiming of stream applications onto embedded multicore processors

In recent years, we have observed the prevalence of stream applications in many embedded domains. Stream applications distinguish themselves from traditional sequential programming languages through well defined independent actors, explicit data communication, and stable code/data access patterns. In order to achieve high performance and low power, scratch pad memory (SPM) has been introduced in today's embedded multicore processors. Programing on SPM based architecture is both challenging and time consuming. In this paper we address the problem of automatic compilation of stream applications onto SPM based embedded multicore processors through unrolling and retiming. In our technique, code overlay and data overlay are implemented to overcome the limited SPM capacity. Smart double buffering and code prefetching are introduced to amortize memory access delays. We evaluated the efficiency of our technique through compiling several stream applications onto the IBM Cell processor and compared their performance with existing approaches.

[1]  Scott A. Mahlke,et al.  Stream Compilation for Real-Time Embedded Multicore Systems , 2009, 2009 International Symposium on Code Generation and Optimization.

[2]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[3]  Charles E. Leiserson,et al.  Retiming synchronous circuitry , 1988, Algorithmica.

[4]  Scott A. Mahlke,et al.  Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.

[5]  Liang-Fang Chao,et al.  Scheduling and behavioral transformation for parallel systems , 1993 .

[6]  Krishnan Srinivasan,et al.  ILP and heuristic techniques for system-level design on network processor architectures , 2007, TODE.

[7]  Edward A. Lee,et al.  Hierarchical static scheduling of dataflow graphs onto multiple processors , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Karam S. Chatha,et al.  Compilation of stream programs onto scratchpad memory based embedded multicore processors through retiming , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[9]  Wen-mei W. Hwu,et al.  MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs , 2008, LCPC.

[10]  Randima Fernando,et al.  The GeForce 6 series GPU architecture , 2005, SIGGRAPH Courses.

[11]  Karam S. Chatha,et al.  Compilation of stream programs for multicore processors that incorporate scratchpad memories , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[12]  Scott A. Mahlke,et al.  Orchestrating the execution of stream programs on multicore platforms , 2008, PLDI '08.

[13]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[14]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[15]  Zhaohui Du,et al.  Data and computation transformations for Brook streaming applications on multiprocessors , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[16]  Scott A. Mahlke,et al.  Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[17]  Giovanni De Micheli,et al.  Synthesis and Optimization of Digital Circuits , 1994 .

[18]  Alain Darte,et al.  Loop Shifting for Loop Compaction , 1999, LCPC.

[19]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.