Language and compiler support for stream programs

Stream programs represent an important class of high-performance computations. Defined by their regular processing of sequences of data, stream programs appear most commonly in the context of audio, video, and digital signal processing, though also in networking, encryption, and other areas. Stream programs can be naturally represented as a graph of independent actors that communicate explicitly over data channels. In this work we focus on programs where the input and output rates of actors are known at compile time, enabling aggressive transformations by the compiler; this model is known as synchronous dataflow. We develop a new programming language, StreamIt, that empowers both programmers and compiler writers to leverage the unique properties of the streaming domain. StreamIt offers several new abstractions, including hierarchical single-input single-output streams, composable primitives for data reordering, and a mechanism called teleport messaging that enables precise event handling in a distributed environment. We demonstrate the feasibility of developing applications in StreamIt via a detailed characterization of our 34,000-line benchmark suite, which spans from MPEG-2 encoding/decoding to GMTI radar processing. We also present a novel dynamic analysis for migrating legacy C programs into a streaming representation. The central premise of stream programming is that it enables the compiler to perform powerful optimizations. We support this premise by presenting a suite of new transformations. We describe the first translation of stream programs into the compressed domain, enabling programs written for uncompressed data formats to automatically operate directly on compressed data formats (based on LZ77). This technique offers a median speedup of 15x on common video editing operations. We also review other optimizations developed in the StreamIt group, including automatic parallelization (offering an 11x mean speedup on the 16-core Raw machine), optimization of linear computations (offering a 5.5x average speedup on a Pentium 4), and cache-aware scheduling (offering a 3.5x mean speedup on a StrongARM 1100). While these transformations are beyond the reach of compilers for traditional languages such as C, they become tractable given the abundant parallelism and regular communication patterns exposed by the stream programming model. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Brian Christopher Smith,et al.  RIVL: A Resolution Independent Video Language , 1995, Tcl/Tk Workshop.

[2]  Jonathan Andersson Modelling and Evaluating the StreamBits language , 2007 .

[3]  Jianmin Jiang,et al.  Image segmentation in compressed domain , 2003, J. Electronic Imaging.

[4]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[5]  Michael I. Gordon,et al.  Language and Compiler Design for Streaming Applications , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[6]  Pradeep Dubey,et al.  Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications , 2008, Proceedings of the IEEE.

[7]  William R. Mark,et al.  Cg: a system for programming graphics hardware in a C-like language , 2003, ACM Trans. Graph..

[8]  Edward A. Lee,et al.  Hierarchical static scheduling of dataflow graphs onto multiple processors , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Praveen K. Murthy,et al.  A buffer merging technique for reducing memory requirements of synchronous dataflow specifications , 1999, Proceedings 12th International Symposium on System Synthesis.

[10]  Edsger W. Dijkstra,et al.  Go to Statement Considered Harmful (Reprint) , 2002, Software Pioneers.

[11]  K. Yelick,et al.  Generating Permutation Instructions from a High-Level Description , 2004 .

[12]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[13]  R. Ferreira,et al.  Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[14]  Juan C. Reyes A graph editing framework for the StreamIt language , 2004 .

[15]  Sitij Agrawal,et al.  Linear State-Space Analysis and Optimization of StreamIt Programs , 2004 .

[16]  William Thies,et al.  Cache aware optimization of stream programs , 2005, LCTES '05.

[17]  Praveen K. Murthy,et al.  System Canvas: a new design environment for embedded DSP and telecommunication systems , 2001, Ninth International Symposium on Hardware/Software Codesign. CODES 2001 (IEEE Cat. No.01TH8571).

[18]  Alan Mycroft,et al.  Redux: A Dynamic Dataflow Tracer , 2003, RV@CAV.

[19]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[20]  David L. Tennenhouse,et al.  The SpectrumWare approach to wireless signal processing , 1996, Wirel. Networks.

[21]  Peter Grant,et al.  Multirate signal processing , 1996 .

[22]  Henry Hoffmann,et al.  A common machine language for grid-based architectures , 2002, CARN.

[23]  Seungwook Hong,et al.  Caption processing for MPEG video in MC-DCT compressed domain , 2000, ACM Multimedia.

[24]  I. K. Sethi,et al.  Convolution-Based Edge Detection for Image/Video in Block DCT Domain , 1996, J. Vis. Commun. Image Represent..

[25]  Wil Plouffe,et al.  An asynchronous programming language and computing machine , 1978 .

[26]  Hiroshi Harada,et al.  Simulation and Software Radio for Mobile Communications , 2002 .

[27]  David Zhang,et al.  A lightweight streaming layer for multicore execution , 2008, CARN.

[28]  Shih-Fu Chang,et al.  Compressed-domain techniques for image/video indexing and manipulation , 1995, Proceedings., International Conference on Image Processing.

[29]  José Muñoz,et al.  ECOS graphs: a dataflow programming language , 1992, SAC '92.

[30]  Stephanie Seneff Speech Transformation System (Spectrum and/or Excitation) without Pitch Extraction. , 1980 .

[31]  Susie J. Wee,et al.  Compressed-domain reverse play of MPEG video streams , 1999, Other Conferences.

[32]  Chitra Dorai,et al.  Detecting dynamic behavior in compressed fingerprint videos: distortion , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[33]  Jürgen Teich,et al.  Multidimensional Exploration of Software Implementations for DSP Algorithms , 2000, J. VLSI Signal Process..

[34]  Praveen K. Murthy,et al.  Shared buffer implementations of signal processing systems usinglifetime analysis techniques , 2001, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[35]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[36]  Bruce A. Draper,et al.  Compiling SA-C Programs to FPGAs: Performance Results , 2001, ICVS.

[37]  William Thies,et al.  Mapping Stream Programs into the Compressed Domain , 2007 .

[38]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[39]  Lawrence Rauchwerger,et al.  Run-Time Parallelization: Its Time Has Come , 1998, Parallel Comput..

[40]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[41]  William Thies,et al.  Phased scheduling of stream programs , 2003 .

[42]  José M. F. Moura,et al.  The Algebraic Approach to the Discrete Cosine and Sine Transforms and Their Fast Algorithms , 2003, SIAM J. Comput..

[43]  C. A. Petri Communication with automata , 1966 .

[44]  Steven Ryan Linear data flow analysis , 1992, SIGP.

[45]  Lawrence A. Rowe,et al.  Compressed Domain Processing of JPEG-encoded imaages , 1996, Real Time Imaging.

[46]  Daniel Gajski,et al.  Partitioning and pipelining for performance-constrained hardware/software systems , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[47]  Matthew Henry Drake,et al.  Stream Programming for Image and Video Compression , 2006 .

[48]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[49]  H. T. Kung,et al.  Automatic Mapping Of Large Signal Processing Systems To A Parallel Machine , 1991, Optics & Photonics.

[50]  Gul A. Agha,et al.  ACTORS - a model of concurrent computation in distributed systems , 1985, MIT Press series in artificial intelligence.

[51]  Peter Henderson,et al.  A lazy evaluator , 1976, POPL.

[52]  R. Karp,et al.  Properties of a model for parallel computations: determinacy , 1966 .

[53]  Rudy Lauwereins,et al.  Data memory minimisation for synchronous data flow graphs emulated on DSP-FPGA targets , 1997, DAC.

[54]  Richard Harrington,et al.  After Effects On the Spot: Time-Saving Tips and Shortcuts from the Pros , 2004 .

[55]  Bryan Chan,et al.  Shader algebra , 2004, SIGGRAPH 2004.

[56]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[57]  Ben Long,et al.  The digital filmmaking handbook , 2000 .

[58]  Sander Stuijk,et al.  Exploring trade-offs in buffer requirements and throughput constraints for synchronous dataflow graphs , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[59]  Koen De Bosschere,et al.  Function level parallelism driven by data dependencies , 2007, CARN.

[60]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[61]  Easwaran Raman,et al.  Speculative Decoupled Software Pipelining , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[62]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[63]  Florence Maraninchi,et al.  Argos: an automaton-based synchronous language , 2001, Comput. Lang..

[64]  Edward A. Lee,et al.  Software Synthesis from Dataflow Graphs , 1996 .

[65]  Calton Pu,et al.  Spidle: A DSL Approach to Specifying Streaming Applications , 2003, GPCE.

[66]  David W. Binkley,et al.  Interprocedural slicing using dependence graphs , 1988, SIGP.

[67]  Keshab K. Parhi,et al.  Static Rate-Optimal Scheduling of Iterative Data-Flow Programs via Optimum Unfolding , 1991, IEEE Trans. Computers.

[68]  Irene Greif,et al.  Semantics of communicating parallel processes , 1975 .

[69]  Edward A. Lee,et al.  Multidimensional synchronous dataflow , 2002, IEEE Trans. Signal Process..

[70]  Bjarne Stroustrup,et al.  The Design and Evolution of C , 1994 .

[71]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[72]  A. I. Reuther,et al.  Preliminary Design Review: GMTI Narrowband for the Basic PCA Integrated Radar-Tracker Application , 2004 .

[73]  Sethuraman Panchanathan,et al.  A critical evaluation of image and video indexing techniques in the compressed domain , 1999, Image Vis. Comput..

[74]  David May,et al.  Communicating Process Architecture: Transputers and Occam , 1986, Future Parallel Computers.

[75]  Guang R. Gao,et al.  Minimizing Buffer Requirements under Rate-Optimal Schedule in Regular Dataflow Networks , 2002, J. VLSI Signal Process..

[76]  Mikkel Thorup,et al.  String Matching in Lempel—Ziv Compressed Strings , 1998, Algorithmica.

[77]  Xin David Zhang,et al.  A Streaming Computation Framework for the Cell Processor , 2007 .

[78]  Edward A. Lee,et al.  Optimal parenthesization of lexical orderings for DSP block diagrams , 1995, VLSI Signal Processing, VIII.

[79]  Ranga Vemuri,et al.  Hardware-software partitioning and pipelined scheduling of transformative applications , 2002, IEEE Trans. Very Large Scale Integr. Syst..

[80]  D.R. O'Hallaron,et al.  The Assign Parallel Program Generator , 1991, The Sixth Distributed Memory Computing Conference, 1991. Proceedings.

[81]  Frank Tip,et al.  A survey of program slicing techniques , 1994, J. Program. Lang..

[82]  Manish Vachharajani,et al.  A Case for Compressing Traces with BDDs , 2006, IEEE Computer Architecture Letters.

[83]  Edward A. Lee,et al.  A HIERARCHICAL MULTIPROCESSOR SCHEDULING FRAMEWORK FOR SYNCHRONOUS DATAFLOW GRAPHS , 1995 .

[84]  Edward A. Lee,et al.  Multirate signal processing in Ptolemy , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[85]  Sanjit K. Mitra,et al.  Image resizing in the compressed domain using subband DCT , 2002, IEEE Trans. Circuits Syst. Video Technol..

[86]  David Q. Mayne,et al.  An elementary derivation of Rosenbrock's minimal realization algorithm , 1973 .

[87]  Long Li,et al.  Automatically partitioning packet processing applications for pipelined architectures , 2005, PLDI '05.

[88]  Armando Solar-Lezama,et al.  Programming by sketching for bit-streaming programs , 2005, PLDI '05.

[89]  David I. August,et al.  Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[90]  Michael D. McCool,et al.  Shader metaprogramming , 2002, HWWS '02.

[91]  Jiawen Chen Load-balanced rendering on a general-purpose tiled architecture , 2005 .

[92]  William Thies,et al.  Linear analysis and optimization of stream programs , 2003, PLDI '03.

[93]  Edwin Hsing-Mean Sha,et al.  Scheduling Data-Flow Graphs via Retiming and Unfolding , 1997, IEEE Trans. Parallel Distributed Syst..

[94]  Tadao Murata,et al.  Petri nets: Properties, analysis and applications , 1989, Proc. IEEE.

[95]  Edward A. Lee,et al.  Overview of the Ptolemy project , 2001 .

[96]  Joe Armstrong,et al.  A history of Erlang , 2007, HOPL.

[97]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[98]  Guang R. Gao,et al.  Minimizing memory requirements in rate-optimal schedules , 1994, Proceedings of IEEE International Conference on Application Specific Array Processors (ASSAP'94).

[99]  JÄnis SermuliÅÅ Cache optimizations for stream programs , 2005 .

[100]  Ken Kennedy,et al.  PFC: A Program to Convert Fortran to Parallel Form , 1982 .

[101]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.

[102]  Ricardo A. Baeza-Yates,et al.  Compression: A Key for Next-Generation Text Retrieval Systems , 2000, Computer.

[103]  Edward A. Lee,et al.  Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing , 1989, IEEE Transactions on Computers.

[104]  Scott A. Mahlke,et al.  Orchestrating the execution of stream programs on multicore platforms , 2008, PLDI '08.

[105]  Albert Benveniste,et al.  Signal-A data flow-oriented language for signal processing , 1986, IEEE Trans. Acoust. Speech Signal Process..

[106]  Victor Eijkhout,et al.  Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[107]  John Glauert,et al.  SISAL: streams and iteration in a single assignment language. Language reference manual, Version 1. 2. Revision 1 , 1985 .

[108]  Brian Christopher Smith,et al.  Compressed domain transcoding of MPEG , 1998, Proceedings. IEEE International Conference on Multimedia Computing and Systems (Cat. No.98TB100241).

[109]  Emden R. Gansner,et al.  Graphviz - Open Source Graph Drawing Tools , 2001, GD.

[110]  Praveen K. Murthy,et al.  Buffer merging—a powerful technique for reducing memory requirements of synchronous dataflow specifications , 2004, TODE.

[111]  Narendra Ahuja,et al.  A fast scheme for image size change in the compressed domain , 2001, IEEE Trans. Circuits Syst. Video Technol..

[112]  Kimberly Sue Kuo The StreamIt development tool : a programming environment for StreamIt , 2004 .

[113]  Ron Cytron,et al.  Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[114]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[115]  Zhaohui Du,et al.  Data and computation transformations for Brook streaming applications on multiprocessors , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[116]  Nicolas Halbwachs,et al.  Synchronous Programming of Reactive Systems , 1992, CAV.

[117]  Gérard Berry,et al.  The Esterel Synchronous Programming Language: Design, Semantics, Implementation , 1992, Sci. Comput. Program..

[118]  Michael Karr,et al.  Affine relationships among variables of a program , 1976, Acta Informatica.

[119]  Joe Armstrong,et al.  Concurrent programming in ERLANG , 1993 .

[120]  Miodrag Potkonjak,et al.  Maximally and arbitrarily fast implementation of linear andfeedback linear computations , 2000, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[121]  Gonzalo Navarro,et al.  Regular expression searching on compressed text , 2003, J. Discrete Algorithms.

[122]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.

[123]  Matthew I. Frank,et al.  SUDS: automatic parallelization for raw processors , 2003 .

[124]  Arvind,et al.  Implicit parallel programming in pH , 2001 .

[125]  Edward A. Lee,et al.  Pipeline interleaved programmable DSP's: Synchronous data flow programming , 1987, IEEE Trans. Acoust. Speech Signal Process..

[126]  Carl Hewitt,et al.  A Universal Modular ACTOR Formalism for Artificial Intelligence , 1973, IJCAI.

[127]  M. Engels,et al.  Grape-II: A System-Level Prototyping Environment for DSP Applications , 1995, Computer.

[128]  Hong Song,et al.  A Programming Model for an Embedded Media Processing Architecture , 2005, SAMOS.

[129]  Yun Zhang,et al.  Revisiting the Sequential Programming Model for Multi-Core , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[130]  William Thies,et al.  A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[131]  William Thies,et al.  Optimizing stream programs using linear state space analysis , 2005, CASES '05.

[132]  Nicolas Halbwachs,et al.  LUSTRE: a declarative language for real-time programming , 1987, POPL '87.

[133]  Gonzalo Navarro,et al.  LZgrep: a Boyer–Moore string matching tool for Ziv–Lempel compressed text , 2005, Softw. Pract. Exp..

[134]  Edward A. Lee,et al.  Compile-time scheduling of dynamic constructs in dataflow program graphs , 1997 .

[135]  A. D. Wyner,et al.  The sliding-window Lempel-Ziv algorithm is asymptotically optimal , 1994, Proc. IEEE.

[136]  Jiawen Chen,et al.  A reconfigurable architecture for load-balanced rendering , 2005, HWWS '05.

[137]  Abdulbasier Aziz,et al.  Image-based motion estimation in a stream programming language , 2007 .

[138]  Bo Shen,et al.  Compressed-Domain Video Processing , 2002 .

[139]  Y.-K. Kwok,et al.  Static scheduling algorithms for allocating directed task graphs to multiprocessors , 1999, CSUR.

[140]  E.A. Lee,et al.  A comparison of synchronous and cycle-static dataflow , 1995, Conference Record of The Twenty-Ninth Asilomar Conference on Signals, Systems and Computers.

[141]  William D. Clinger,et al.  Foundations of Actor Semantics , 1981 .

[142]  K. J. Ray Liu,et al.  An efficient timing model for hardware implementation of multirate dataflow graphs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[143]  Shan Shan Huang,et al.  Liquid Metal: Object-Oriented Programming Across the Hardware/Software Boundary , 2008, ECOOP.

[144]  Edward A. Lee,et al.  Synthesis of Embedded Software from Synchronous Dataflow Specifications , 1999, J. VLSI Signal Process..

[145]  Nicolas Halbwachs,et al.  Automatic discovery of linear restraints among variables of a program , 1978, POPL.

[146]  Inmos Limited,et al.  OCCAM 2 reference manual , 1988 .

[147]  William W. Wadge,et al.  Lucid, a nonprocedural language with iteration , 1977, CACM.

[148]  Mendel Rosenblum,et al.  Stream programming on general-purpose processors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[149]  David A. Padua,et al.  High-Speed Multiprocessors and Compilation Techniques , 1980, IEEE Transactions on Computers.

[150]  David Padua,et al.  Automatic Optimization of DSP Algorithms , 2001 .

[151]  Rodric M. Rabbah,et al.  A Productive Programming Environment for Stream Computing , 2005 .

[152]  Kenji Shoji An algorithm for affine transformation of binary images stored in pxy tables by run format , 1995, Systems and Computers in Japan.

[153]  Robert Stephens,et al.  A survey of stream processing , 1997, Acta Informatica.

[154]  David Tarditi,et al.  Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.

[155]  William Thies,et al.  Teleport messaging for distributed stream programs , 2005, PPoPP.

[156]  Rudy Lauwereins,et al.  Cyclo-static data flow , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[157]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[158]  Jürgen Teich,et al.  3D exploration of software schedules for DSP algorithms , 1999, CODES '99.

[159]  William J. Dally,et al.  Programmable Stream Processors , 2003, Computer.

[160]  Pascal Raymond,et al.  The synchronous data flow programming language LUSTRE , 1991, Proc. IEEE.

[161]  Henry Hoffmann,et al.  MPEG-2 decoding in a stream programming language , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[162]  Henk Corporaal,et al.  Overcoming the Limitations of the Traditional Loop Parallelization , 1997, HPCN Europe.

[163]  Richard S. Bucy,et al.  Canonical Minimal Realization of a Matrix of Impulse Response Sequences , 1971, Inf. Control..

[164]  M. Covell,et al.  An algorithm design environment for signal processing , 1990, International Conference on Acoustics, Speech, and Signal Processing.