Rapid Runtime Estimation Methods for Pipelined MPSoCs targeting Streaming Applications

The pipelined Multiprocessor System on Chip (MPSoC) paradigm is well suited to the data flow nature of streaming applications, specifically multimedia applications. A pipelined MPSoC is a system where processors are connected in a pipeline. To balance the pipeline for high throughput and reduced area footprint, Application Specific Instruction set Processors (ASIPs) are used as the building blocks. Each ASIP in the system has a number of configurations which differ by instruction sets and cache sizes. The design space of a pipelined MPSoC is all the possible permutations of the ASIP configurations. To estimate the runtime of a pipelined MPSoC with one combination of ASIP configurations, designers typically perform cycle-accurate simulation of the whole pipelined MPSoC. Since the number of possible combinations of ASIP configurations (design points) can be in the order of billions, estimation methods are necessary. In this paper, we propose two methods to estimate the runtime of a pipelined MPSoC, minimizing the use of slow cycle-accurate simulations. The first method performs cycle accurate simulations of individual ASIP configurations rather than the whole system, and then utilizes an analytical model of the pipelined MPSoC to estimate its runtime. In the second method, runtimes of individual ASIP configurations are estimated using an analytical processor model. These estimated runtimes of individual ASIP configurations are then used in pipelined MPSoC’s analytical model to estimate its runtime. By evaluating our approach on four benchmarks, we show that the maximum estimation error is 5.91% and 13.21%, with an average estimation error of 2.28% and 5.91% for the first and second method respectively. The time to cycle-accurately simulate the whole design space of a pipelined MPSoC is in the order of years, as design spaces with at least 10 design points are considered in this paper. However, the time for cycle-accurate simulations of individual ASIP configurations (first method) is days, while the time to simulate a subset of ASIP configurations and estimate their runtimes (second method) is only several hours. Once these simulations are done, the runtime of each design point can just be estimated by using the pipelined MPSoC’s analytical model’s estimation equation.

[1]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[2]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[3]  Alice C. Parker,et al.  Optimal synthesis of application specific heterogeneous pipelined multiprocessors , 1994, Proceedings of IEEE International Conference on Application Specific Array Processors (ASSAP'94).

[4]  Paul M. Chau,et al.  Macro pipelining based scheduling on high performance heterogeneous multiprocessor systems , 1995, IEEE Trans. Signal Process..

[5]  Henk Corporaal,et al.  Design of heterogenous multi-processor embedded systems: applying functional pipelining , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[6]  Kiyoung Choi,et al.  Loop pipelining in hardware-software partitioning , 1998, Proceedings of 1998 Asia and South Pacific Design Automation Conference.

[7]  Daniel Gajski,et al.  Partitioning and pipelining for performance-constrained hardware/software systems , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[8]  Hironori Kasahara,et al.  Multigrain parallel processing for JPEG encoding on a single chip multiprocessor , 2002, International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems.

[9]  Jeffry T. Russell,et al.  Architecture-level performance evaluation of component-based embedded systems , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[10]  James E. Smith,et al.  A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[11]  Shiann-Rong Kuang,et al.  Partitioning and Pipelined Scheduling of Embedded System Using Integer Linear Programming , 2005, 11th International Conference on Parallel and Distributed Systems (ICPADS'05).

[12]  Christian Poellabauer,et al.  Monitoring of cache miss rates for accurate dynamic voltage and frequency scaling , 2005, IS&T/SPIE Electronic Imaging.

[13]  Jeanine Cook,et al.  Performance modeling using Monte Carlo simulation , 2006, IEEE Computer Architecture Letters.

[14]  Kapil Vaswani,et al.  Construction and use of linear regression models for processor performance analysis , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[15]  Sri Parameswaran,et al.  Heterogeneous multiprocessor implementations for JPEG:: a case study , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[16]  David M. Brooks,et al.  Accurate and efficient regression modeling for microarchitectural performance and power prediction , 2006, ASPLOS XII.

[17]  Sri Parameswaran,et al.  Design Methodology for Pipelined Heterogeneous Multiprocessor System , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[18]  Matt T. Yourst PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[19]  Soheil Ghiasi,et al.  System-Level Performance Estimation for Application-Specific MPSoC Interconnect Synthesis , 2008, 2008 Symposium on Application Specific Processors.

[20]  Sri Parameswaran,et al.  Synthesis of heterogeneous pipelined multiprocessor systems using ILP: jpeg case study , 2008, CODES+ISSS '08.

[21]  Nozomu Togawa,et al.  Exact and fast L1 cache simulation for embedded systems , 2009, 2009 Asia and South Pacific Design Automation Conference.

[22]  Sri Parameswaran,et al.  A design flow for application specific heterogeneous pipelined multiprocessor systems , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[23]  Sri Parameswaran,et al.  SuSeSim: a fast simulation strategy to find optimal L1 cache configuration for embedded systems , 2009, CODES+ISSS '09.