StreamTMC: Stream compilation for tiled multi-core architectures

Tiled multi-core architectures have become an important kind of multi-core design for its good scalability and low power consumption. Stream programming has been productively applied to a number of important application domains. It provides an attractive way to exploit the parallelism. However, the architecture characteristics of large amounts of cores, memory hierarchy and exposed communication between tiles have presented a performance challenge for stream programs running on tiled multi-cores. In this paper, we present StreamTMC, an efficient stream compilation framework that optimizes the execution of stream applications for the tiled multi-core. This framework is composed of three optimization phases. First, a software pipelining schedule is constructed to exploit the parallelism. Second, an efficient hybrid of SPM and cache buffer allocation algorithm and data copy elimination mechanism is proposed to improve the efficiency of the data access. Last, a communication aware mapping is proposed to reduce the network communication and synchronization overhead. We implement the StreamTMC compiler on Godson-T, a 64-core tiled architecture and conduct an experimental study to verify the effectiveness. The experimental results indicate that StreamTMC can achieve an average of 58% improvement over the performance before optimization.

[1]  Trevor Mudge,et al.  MacroSS: macro-SIMDization of streaming applications , 2010, ASPLOS 2010.

[2]  Scott A. Mahlke,et al.  Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.

[3]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[4]  Dongrui Fan,et al.  High Performance Matrix Multiplication on Many Cores , 2009, Euro-Par.

[5]  Guang R. Gao,et al.  Minimizing communication in rate-optimal software pipelining for stream programs , 2010, CGO '10.

[6]  Saurabh Dighe,et al.  An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[7]  Guang R. Gao,et al.  Experience on optimizing irregular computation for memory hierarchy in manycore architecture , 2008, ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming.

[8]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[9]  Scott A. Mahlke,et al.  Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[10]  David Kirk,et al.  NVIDIA cuda software and gpu parallel computing architecture , 2007, ISMM '07.

[11]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[12]  Edward A. Lee,et al.  Synthesis of Embedded Software from Synchronous Dataflow Specifications , 1999, J. VLSI Signal Process..

[13]  Abhishek Udupa,et al.  Software Pipelined Execution of Stream Programs on GPUs , 2009, 2009 International Symposium on Code Generation and Optimization.

[14]  Guang R. Gao,et al.  Software Pipelining for Stream Programs on Resource Constrained Multicore Architectures , 2012, IEEE Transactions on Parallel and Distributed Systems.

[15]  Edward A. Lee,et al.  Compile-Time Scheduling and Assignment of Data-Flow Program Graphs with Data-Dependent Iteration , 1991, IEEE Trans. Computers.

[16]  William Thies,et al.  Cache aware optimization of stream programs , 2005, LCTES.

[17]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[18]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[19]  Scott A. Mahlke,et al.  MacroSS: macro-SIMDization of streaming applications , 2010, ASPLOS XV.

[20]  William Thies,et al.  Phased scheduling of stream programs , 2003 .

[21]  Michael F. P. O'Boyle,et al.  Partitioning streaming parallelism for multi-cores: A machine learning based approach , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  Scott A. Mahlke,et al.  Orchestrating the execution of stream programs on multicore platforms , 2008, PLDI '08.

[23]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[24]  Edward A. Lee,et al.  A HIERARCHICAL MULTIPROCESSOR SCHEDULING FRAMEWORK FOR SYNCHRONOUS DATAFLOW GRAPHS , 1995 .

[25]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[26]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[27]  Hong Song,et al.  A Programming Model for an Embedded Media Processing Architecture , 2005, SAMOS.