Automated software synthesis for streaming applications on embedded manycore processors

Stream applications are characterized by the requirement to process a virtually infinite sequence of data items. They appear in many areas including communication, networking, multimedia and cryptography. Embedded manycore systems, currently in the range of hundreds of cores, have shown a tremendous potential in achieving high throughput and low power consumption for such applications. The focus of this dissertation is on automated synthesis of parallel software for stream applications on embedded manycore systems. Automated software synthesis significantly reduces the development and debug time. The vision is to enable seamless and efficient transformation from a higher-order specification of the stream application (e.g., dataflow graph) to parallel software code (e.g., multiple .C files) for a given target manycore system. This automated process involves many steps that are being actively researched, including workload estimation of tasks (actors) in the dataflow graph, allocation of tasks to processors, scheduling of tasks for execution on the processors, binding of processors to physical cores on the chip, binding of communications to physical channels on the chip, generation of the parallel software code, backend code optimization and estimation of throughput. This dissertation improves on the state-of-the-art by making the following contributions. First, a versatile task allocation algorithm for pipelined execution is proposed that is provably-efficient and can be configured to target platforms with different underlying architectures. Second, a throughput estimation method is introduced that has acceptable accuracy, high scalability with respect to the number of cores, and a high degree of freedom in targeting platforms with different underlying onchip networks. Third, a task scheduling algorithm is proposed, based on iteration overlapping techniques, which explores the tradeoff between throughput and memory requirements for manycore platforms with and without FIFO-based onchip communication channels. Finally, to increase the scalability of application throughput with respect to the number of cores, a malleable dataflow specification model is proposed.

[1]  William Thies,et al.  Phased scheduling of stream programs , 2003, LCTES '03.

[2]  Giovanni De Micheli,et al.  Synthesis and Optimization of Digital Circuits , 1994 .

[3]  Rajeev Barua,et al.  Memory allocation for embedded systems with a compile-time-unknown scratch-pad size , 2005, CASES '05.

[4]  Edward A. Lee,et al.  Dataflow process networks , 1995, Proc. IEEE.

[5]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[6]  Edward A. Lee,et al.  A causality interface for deadlock analysis in dataflow , 2006, EMSOFT '06.

[7]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[8]  Vivek Sarkar,et al.  Determining average program execution times and their variance , 1989, PLDI '89.

[9]  Massoud Pedram,et al.  Architectures for silicon nanoelectronics and beyond , 2007, Computer.

[10]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[11]  Jaejin Lee,et al.  FaCSim: a fast and cycle-accurate architecture simulator for embedded systems , 2008, LCTES '08.

[12]  Shuvra S. Bhattacharyya,et al.  Functional DIF for Rapid Prototyping , 2008, 2008 The 19th IEEE/IFIP International Symposium on Rapid System Prototyping.

[13]  Sander Stuijk,et al.  Liveness and Boundedness of Synchronous Data Flow Graphs , 2006, 2006 Formal Methods in Computer Aided Design.

[14]  Edward A. Lee The problem with threads , 2006, Computer.

[15]  Emery D. Berger,et al.  Grace: safe multithreaded programming for C/C++ , 2009, OOPSLA '09.

[16]  Srivaths Ravi,et al.  Energy-optimizing source code transformations for operating system-driven embedded software , 2007, TECS.

[17]  Soheil Ghiasi,et al.  Throughput-driven synthesis of embedded software for pipelined execution on multicore architectures , 2009, TECS.

[18]  Edward A. Lee,et al.  Synthesis of Embedded Software from Synchronous Dataflow Specifications , 1999, J. VLSI Signal Process..

[19]  Sander Stuijk,et al.  Throughput Analysis of Synchronous Data Flow Graphs , 2006, Sixth International Conference on Application of Concurrency to System Design (ACSD'06).

[20]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[21]  Guang R. Gao,et al.  Software pipelining showdown: optimal vs. heuristic methods in a production compiler , 1996, PLDI '96.

[22]  William Thies,et al.  Teleport messaging for distributed stream programs , 2005, PPoPP.

[23]  Krzysztof Kuchcinski,et al.  Partial task assignment of task graphs under heterogeneous resource constraints , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[24]  Praveen K. Murthy,et al.  Buffer merging—a powerful technique for reducing memory requirements of synchronous dataflow specifications , 2004, TODE.

[25]  Tinoosh Mohsenin,et al.  Algorithms and architectures for efficient low density parity check (ldpc) decoder hardware , 2010 .

[26]  Soheil Ghiasi,et al.  Exact and Approximate Task Assignment Algorithms for Pipelined Software Synthesis , 2008, 2008 Design, Automation and Test in Europe.

[27]  E. A. de Kock Multiprocessor mapping of process networks: a JPEG decoding case study , 2002 .

[28]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[29]  Praveen K. Murthy,et al.  Beyond single-appearance schedules: Efficient DSP software synthesis using nested procedure calls , 2007, TECS.

[30]  Twan Basten,et al.  Simultaneous budget and buffer size computation for throughput-constrained task graphs , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[31]  Coniferous softwood GENERAL TERMS , 2003 .

[32]  Edward A. Lee,et al.  Software Synthesis from Dataflow Graphs , 1996 .

[33]  Walid Taha,et al.  A Gentle Introduction to Multi-stage Programming , 2003, Domain-Specific Program Generation.

[34]  Bevan M. Baas,et al.  A high-performance parallel CAVLC encoder on a fine-grained many-core system , 2008, 2008 IEEE International Conference on Computer Design.

[35]  Sander Stuijk,et al.  Throughput-Buffering Trade-Off Exploration for Cyclo-Static and Synchronous Dataflow Graphs , 2008, IEEE Transactions on Computers.

[36]  Alan Gray,et al.  Deterministic Parallel Processing , 2006, International Journal of Parallel Programming.

[37]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[38]  Stefan Rusu,et al.  A 45nm 8-core enterprise Xeon ® processor , 2009 .

[39]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[40]  Scott A. Mahlke,et al.  MacroSS: macro-SIMDization of streaming applications , 2010, ASPLOS XV.

[41]  Anant Agarwal,et al.  The KILL Rule for Multicore , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[42]  Soheil Ghiasi,et al.  Look into details: the benefits of fine-grain streaming buffer analysis , 2010, LCTES '10.

[43]  Marc Pouzet,et al.  Towards a higher-order synchronous data-flow language , 2004, EMSOFT '04.

[44]  Jürgen Teich,et al.  Multidimensional Exploration of Software Implementations for DSP Algorithms , 2000, J. VLSI Signal Process..

[45]  Radu Marculescu,et al.  Energy- and performance-aware mapping for regular NoC architectures , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[46]  Koushik Sen,et al.  A randomized dynamic program analysis technique for detecting real deadlocks , 2009, PLDI '09.

[47]  Lokesh Sharma,et al.  A 32nm Westmere-EX Xeon® enterprise processor , 2011, 2011 IEEE International Solid-State Circuits Conference.

[48]  Soheil Ghiasi,et al.  Versatile Task Assignment for Heterogeneous Soft Dual-Processor Platforms , 2010, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[49]  Soheil Ghiasi,et al.  Joint throughput and energy optimization for pipelined execution of embedded streaming applications , 2007, LCTES '07.

[50]  Bart Kienhuis,et al.  Automatic partitioning and mapping of stream-based applications onto the Intel IXP Network processor , 2007, SCOPES '07.

[51]  Lothar Thiele,et al.  Performance analysis of distributed embedded systems , 2007, EMSOFT '07.

[52]  Jakob Engblom,et al.  The worst-case execution-time problem—overview of methods and survey of tools , 2008, TECS.

[53]  George Karypis,et al.  Architecture Aware Partitioning Algorithms , 2008, ICA3PP.

[54]  Edward A. Lee Building Unreliable Systems out of Reliable Components : The Real Time Story , 2005 .

[55]  Luca Benini,et al.  Dynamic frequency scaling with buffer insertion for mixed workloads , 2002, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[56]  Eorge,et al.  Unstructured Graph Partitioning and Sparse Matrix Ordering System Version 2 . 0 , 1995 .

[57]  Thomas A. Henzinger,et al.  The Embedded Systems Design Challenge , 2006, FM.

[58]  Edward A. Lee,et al.  Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing , 1989, IEEE Transactions on Computers.

[59]  Zhiyi Yu,et al.  A 167-Processor Computational Platform in 65 nm CMOS , 2009, IEEE Journal of Solid-State Circuits.

[60]  Alberto L. Sangiovanni-Vincentelli,et al.  Benefits and challenges for platform-based design , 2004, Proceedings. 41st Design Automation Conference, 2004..

[61]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[62]  William Thies,et al.  An empirical characterization of stream programs and its implications for language and compiler design , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[63]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[64]  Twan Basten,et al.  Reactive process networks , 2004, EMSOFT '04.

[65]  T. Mohsenin,et al.  A 167-processor 65 nm computational platform with per-processor dynamic supply voltage and dynamic clock frequency scaling , 2008, 2008 IEEE Symposium on VLSI Circuits.

[66]  Soheil Ghiasi,et al.  System-Level Performance Estimation for Application-Specific MPSoC Interconnect Synthesis , 2008, 2008 Symposium on Application Specific Processors.

[67]  Yu Wang,et al.  An efficient technique for analysis of minimal buffer requirements of synchronous dataflow graphs with model checking , 2009, CODES+ISSS '09.

[68]  Rudy Lauwereins,et al.  Data memory minimisation for synchronous data flow graphs emulated on DSP-FPGA targets , 1997, DAC.

[69]  R. Passerone,et al.  System level design paradigms: Platform-based design and communication synthesis , 2004 .

[70]  Scott A. Mahlke,et al.  Orchestrating the execution of stream programs on multicore platforms , 2008, PLDI '08.

[71]  Alvise Bonivento,et al.  System level design paradigms: Platform-based design and communication synthesis , 2006, ACM Trans. Design Autom. Electr. Syst..

[72]  James R. Larus,et al.  Software and the Concurrency Revolution , 2005, ACM Queue.

[73]  Luciano Lavagno,et al.  Metropolis: An Integrated Electronic System Design Environment , 2003, Computer.

[74]  Scott A. Mahlke,et al.  Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[75]  Soonhoi Ha,et al.  Dynamic voltage scheduling with buffers in low-power multimedia applications , 2004, TECS.

[76]  Jason Cong,et al.  Synthesis of an application-specific soft multiprocessor system , 2007, FPGA '07.

[77]  Michael I. Gordon Compiler techniques for scalable performance of stream programs on multicore architectures , 2010 .

[78]  Rajeev Barua,et al.  Dynamic allocation for scratch-pad memory using compile-time decisions , 2006, TECS.

[79]  N. Ranganathan,et al.  A learning automata based framework for task assignment in heterogeneous computing systems , 1999, SAC '99.

[80]  Tinoosh Mohsenin,et al.  Multi-Split-Row Threshold decoding implementations for LDPC codes , 2009, 2009 IEEE International Symposium on Circuits and Systems.

[81]  Shuvra S. Bhattacharyya,et al.  Parameterized dataflow modeling of DSP systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[82]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[83]  Edward A. Lee,et al.  Software synthesis for DSP using ptolemy , 1995, J. VLSI Signal Process..

[84]  T. Mohsenin,et al.  An asynchronous array of simple processors for dsp applications , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[85]  Bevan M. Baas,et al.  Massively parallel processor array for mid-/back-end ultrasound signal processing , 2010, 2010 Biomedical Circuits and Systems Conference (BioCAS).

[86]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[87]  Sander Stuijk,et al.  Multiprocessor Resource Allocation for Throughput-Constrained Synchronous Dataflow Graphs , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[88]  Andy D. Pimentel,et al.  Multiobjective optimization and evolutionary algorithms for the application mapping problem in multiprocessor system-on-chip design , 2006, IEEE Transactions on Evolutionary Computation.

[89]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[90]  Doug Pulley Multi-core DSP for base stations: Large and small , 2008, 2008 Asia and South Pacific Design Automation Conference.

[91]  William J. Dally,et al.  Tradeoff between data-, instruction-, and thread-level parallelism in stream processors , 2007, ICS '07.

[92]  William J. Dally,et al.  Buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures , 2010, SPAA '10.