Streamroller : A Unified Compilation and Synthesis System for Streaming Applications

The growing complexity of applications has increased the need for higher processing power. In the embedded domain, the convergence of audio, video, and networking on a handheld device has prompted the need for low cost, low power, and high performance implementations of these applications in the form of custom hardware. In a more mainstream domain like gaming consoles, the move towards more realism in physics simulations and graphics has forced the industry towards multicore systems. Many of the applications in these domains are streaming in nature. The key challenge is to get efficient implementations of custom hardware from these applications and map these applications efficiently onto multicore architectures. This dissertation presents a unified methodology, referred to as Streamroller, that can be applied for the problem of scheduling stream programs to multicore architectures and to the problem of automatic synthesis of custom hardware for stream applications. Firstly, a method called stream-graph modulo scheduling is presented, which maps stream programs effectively onto a multicore architecture. Many aspects of a real system, like limited memory and explicit DMAs are modeled in the scheduler. The scheduler is evaluated for a set of stream programs on IBM's Cell processor. Secondly, an automated high-level synthesis system for creating custom hardware for stream applications is presented. The template for the custom hardware is a pipeline of accelerators. The synthesis involves designing loop accelerators for individual kernels, instantiating buffers to store data passed between kernels, and linking these building blocks to form a pipeline. A unique aspect of this system is the use of multifunction accelerators, which improves cost by efficiently sharing hardware between multiple kernels. Finally, a method to improve the integer linear program formulations used in the schedulers that exploits symmetry in the solution space is presented. Symmetry-breaking constraints are added to the formulation, and the performance of the solver is evaluated.

[1]  Edward A. Lee,et al.  Pipeline interleaved programmable DSP's: Synchronous data flow programming , 1987, IEEE Trans. Acoust. Speech Signal Process..

[2]  Daniel Gajski,et al.  Chippe: a system for constraint driven behavioral synthesis , 1990, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[3]  Igor L. Markov,et al.  Solving difficult SAT instances in the presence of symmetry , 2002, Proceedings 2002 Design Automation Conference (IEEE Cat. No.02CH37324).

[4]  Edward A. Lee,et al.  Compile-Time Scheduling and Assignment of Data-Flow Program Graphs with Data-Dependent Iteration , 1991, IEEE Trans. Computers.

[5]  Scott A. Mahlke,et al.  Increasing hardware efficiency with multifunction loop accelerators , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[6]  Peter Marwedel,et al.  The MIMOLA Design System: Detailed Description of the Software System , 1979, 16th Design Automation Conference.

[7]  William J. Dally,et al.  Compilation for explicitly managed memory hierarchies , 2007, PPOPP.

[8]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[9]  Calton Pu,et al.  Spidle: A DSL Approach to Specifying Streaming Applications , 2003, GPCE.

[10]  Mendel Rosenblum,et al.  Stream programming on general-purpose processors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[11]  Keshab K. Parhi,et al.  Static Rate-Optimal Scheduling of Iterative Data-Flow Programs via Optimum Unfolding , 1991, IEEE Trans. Computers.

[12]  Anant Agarwal,et al.  Versatility and VersaBench: A New Metric and a Benchmark Suite for Flexible Architectures , 2004 .

[13]  Patrice Quinton,et al.  The ALPHA language and its use for the design of systolic arrays , 1991, J. VLSI Signal Process..

[14]  V. Kaibel,et al.  Packing and partitioning orbitopes , 2006, math/0603678.

[15]  Edward A. Lee,et al.  A HIERARCHICAL MULTIPROCESSOR SCHEDULING FRAMEWORK FOR SYNCHRONOUS DATAFLOW GRAPHS , 1995 .

[16]  Antonio González,et al.  Modulo scheduling for a fully-distributed clustered VLIW architecture , 2000, MICRO 33.

[17]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[18]  Igor L. Markov,et al.  Symmetry breaking for pseudo-Boolean formulas , 2008, JEAL.

[19]  Donald E. Thomas,et al.  The system architect's workbench , 1988, DAC '88.

[20]  Ken Kennedy,et al.  Parascope:a Parallel Programming Environment , 1988 .

[21]  Edward A. Lee,et al.  Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing , 1989, IEEE Transactions on Computers.

[22]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[23]  Scott A. Mahlke,et al.  Hierarchical coarse-grained stream compilation for software defined radio , 2007, CASES '07.

[24]  Zhaohui Du,et al.  Data and computation transformations for Brook streaming applications on multiprocessors , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[25]  Michael K. Chen,et al.  Shangri-La: achieving high performance from compiled network applications while enabling ease of programming , 2005, PLDI '05.

[26]  James Mackenzie Crawford A theoretical analysis of reasoning by symmetry in first-order logic (extended abstract) , 1992 .

[27]  François Margot,et al.  Pruning by isomorphism in branch-and-cut , 2001, Math. Program..

[28]  Hugo De Man,et al.  Cathedral-III : architecture-driven high-level synthesis for high throughput DSP applications , 1991, 28th ACM/IEEE Design Automation Conference.

[29]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[30]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[31]  Igor L. Markov,et al.  Generic ILP versus specialized 0-1 ILP: an update , 2002, IEEE/ACM International Conference on Computer Aided Design, 2002. ICCAD 2002..

[32]  Scott A. Mahlke,et al.  PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators , 2002, J. VLSI Signal Process..

[33]  Igor L. Markov,et al.  Breaking instance-independent symmetries in exact graph coloring , 2004 .

[34]  Scott A. Mahlke,et al.  Orchestrating the execution of stream programs on multicore platforms , 2008, PLDI '08.

[35]  Isabel Méndez-Díaz,et al.  A Polyhedral Approach for Graph Coloring1 , 2001, Electron. Notes Discret. Math..

[36]  Scott A. Mahlke,et al.  Streamroller:: automatic synthesis of prescribed throughput accelerator pipelines , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[37]  Alexandros Stamatakis,et al.  Dynamic multigrain parallelization on the cell broadband engine , 2007, PPoPP.

[38]  Hanif D. Sherali,et al.  Improving Discrete Model Representations via Symmetry Considerations , 2001, Manag. Sci..

[39]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[40]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[41]  Scott A. Mahlke,et al.  High-level synthesis of nonprogrammable hardware accelerators , 2000, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors.

[42]  Hong Song,et al.  A Programming Model for an Embedded Media Processing Architecture , 2005, SAMOS.

[43]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[44]  William R. Mark,et al.  Cg: a system for programming graphics hardware in a C-like language , 2003, ACM Trans. Graph..

[45]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[46]  Giovanni De Micheli,et al.  HERCULES-a system for high-level synthesis , 1988, 25th ACM/IEEE, Design Automation Conference.Proceedings 1988..

[47]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[48]  Hong-Seok Kim,et al.  Bottom-Up and Top-Down Context-Sensitive Summary-Based Pointer Analysis , 2004, SAS.

[49]  Daniel D. Gajski,et al.  VHDL Synthesis System (VSS) User's Manual Version 5.0 , 1992 .

[50]  Volker Kaibel,et al.  Extended Formulations for Packing and Partitioning Orbitopes , 2008, Math. Oper. Res..

[51]  David L. Kuck,et al.  The Structure of Computers and Computations , 1978 .

[52]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[53]  Scott A. Mahlke,et al.  Cost sensitive modulo scheduling in a loop accelerator synthesis system , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[54]  Raul Camposano Design process model in the Yorktown silicon compiler , 1988, 25th ACM/IEEE, Design Automation Conference.Proceedings 1988..

[55]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[56]  Gilbert Wolrich,et al.  The next generation of Intel IXP network processors , 2002 .

[57]  B. R. Rau,et al.  Code generation schema for modulo scheduled loops , 1992, MICRO 1992.

[58]  Igor L. Markov,et al.  Faster symmetry discovery using sparsity of symmetries , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[59]  James M. Crawford,et al.  Symmetry-Breaking Predicates for Search Problems , 1996, KR.

[60]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[61]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[62]  Miodrag Potkonjak,et al.  HYPER: an interactive synthesis environment for high performance real time applications , 1989, Proceedings 1989 IEEE International Conference on Computer Design: VLSI in Computers and Processors.