Compilation of stream programs onto scratchpad memory based embedded multicore processors through retiming

The prevalence of stream applications in signal processing, multi-media, and network processing domains has resulted in a new trend of programming and architecture design. Several languages and multicore architectures have been developed to support streaming applications. In many of these multicore architectures scratchpad memories (SPM) have substituted caches due to their lower power consumption. Performance optimization on SPM based architectures requires the programmer/compiler to efficiently manage the limited local memory. Our paper addresses the problem of compilation of stream programs onto multicore architectures that incorporate SPMs. We propose a retiming technique that maximizes the throughput under a memory constraint with a user-specified number of software pipeline stages. Trade-offs between double buffering and code overlay are explored intensively in our technique to achieve the best performance. The efficiency of our technique was evaluated by compiling several stream applications for the IBM Cell BE and comparing their results against existing approaches.

[1]  Edward A. Lee,et al.  Hierarchical static scheduling of dataflow graphs onto multiple processors , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Scott A. Mahlke,et al.  Stream Compilation for Real-Time Embedded Multicore Systems , 2009, 2009 International Symposium on Code Generation and Optimization.

[3]  Karam S. Chatha,et al.  Compilation of stream programs for multicore processors that incorporate scratchpad memories , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[4]  Scott A. Mahlke,et al.  Orchestrating the execution of stream programs on multicore platforms , 2008, PLDI '08.

[5]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[6]  Charles E. Leiserson,et al.  Retiming synchronous circuitry , 1988, Algorithmica.

[7]  William J. Dally,et al.  The Imagine Stream Processor , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[8]  Zhaohui Du,et al.  Data and computation transformations for Brook streaming applications on multiprocessors , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[9]  Wen-mei W. Hwu,et al.  MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs , 2008, LCPC.

[10]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[11]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[12]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[13]  Randima Fernando,et al.  The GeForce 6 series GPU architecture , 2005, SIGGRAPH Courses.