Maximizing the Number of Good Dies for Streaming Applications in NoC-Based MPSoCs Under Process Variation

Scaling CMOS technology into nanometer feature-size nodes has made it practically impossible to precisely control the manufacturing process. This results in variation in the speed and power consumption of a circuit. As a solution to process-induced variations, circuits are conventionally implemented with conservative design margins to guarantee the target frequency of each hardware component in manufactured multiprocessor chips. This approach, referred to as worst-case design, results in a considerable circuit upsizing, in turn reducing the number of dies on a wafer. This work deals with the design of real-time systems for streaming applications (e.g., video decoders) constrained by a throughput requirement (e.g., frames per second) with reduced design margins, referred to as better-than-worst-case design. To this end, the first contribution of this work is a complete modeling framework that captures a streaming application mapped to an NoC-based multiprocessor system with voltage-frequency islands under process-induced die-to-die and within-die frequency variations. The framework is used to analyze the impact of variations in the frequency of hardware components on application throughput at the system level. The second contribution of this work is a methodology to use the proposed framework and estimate the impact of reducing circuit design margins on the number of good dies that satisfy the throughput requirement of a real-time streaming application. We show on both synthetic and real applications that the proposed better-than-worst-case design approach can increase the number of good dies by up to 9.6% and 18.8% for designs with and without fixed SRAM and IO blocks, respectively.

[1]  Sander Stuijk,et al.  Multiprocessor Resource Allocation for Throughput-Constrained Synchronous Dataflow Graphs , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[2]  Borivoje Nikolic,et al.  Measurement and Analysis of Variability in 45 nm Strained-Si CMOS Technology , 2009, IEEE Journal of Solid-State Circuits.

[3]  Kees G. W. Goossens,et al.  Enabling application-level performance guarantees in network-based systems on chip by applying dataflow analysis , 2009, IET Comput. Digit. Tech..

[4]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[5]  Axel Jantsch,et al.  The Nostrum backbone-a communication protocol stack for Networks on Chip , 2004, 17th International Conference on VLSI Design. Proceedings..

[6]  Sander Stuijk,et al.  CA-MPSoC: An automated design flow for predictable multi-processor architectures for multiple applications , 2010, J. Syst. Archit..

[7]  Costas J. Spanos,et al.  Modeling within-die spatial correlation effects for process-design co-optimization , 2005, Sixth international symposium on quality electronic design (isqed'05).

[8]  Paul Zuber,et al.  Variability aware modeling of SoCs: From device variations to manufactured system yield , 2009, 2009 10th International Symposium on Quality Electronic Design.

[9]  Sander Stuijk,et al.  Throughput-Buffering Trade-Off Exploration for Cyclo-Static and Synchronous Dataflow Graphs , 2008, IEEE Transactions on Computers.

[10]  Davit Mirzoyan Better than Worst-Case Design for Streaming Applications under Process Variation , 2013 .

[11]  Edward A. Lee,et al.  Synthesis of Embedded Software from Synchronous Dataflow Specifications , 1999, J. VLSI Signal Process..

[12]  Sander Stuijk,et al.  Throughput Analysis of Synchronous Data Flow Graphs , 2006, Sixth International Conference on Application of Concurrency to System Design (ACSD'06).

[13]  Sander Stuijk,et al.  SDF^3: SDF For Free , 2006, Sixth International Conference on Application of Concurrency to System Design (ACSD'06).

[14]  José Pineda de Gyvez,et al.  Body-Bias-Driven Design Strategy for Area- and Performance-Efficient CMOS Circuits , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[15]  Soonhoi Ha,et al.  Fractional Rate Dataflow Model for Efficient Code Synthesis , 2004, J. VLSI Signal Process..

[16]  Anujan Varma,et al.  Latency-rate servers: a general model for analysis of traffic scheduling algorithms , 1996, Proceedings of IEEE INFOCOM '96. Conference on Computer Communications.

[17]  Qiang Xu,et al.  Performance yield-driven task allocation and scheduling for MPSoCs under process variation , 2010, Design Automation Conference.

[18]  Saurabh Dighe,et al.  Within-Die Variation-Aware Dynamic-Voltage-Frequency-Scaling With Optimal Core Allocation and Thread Hopping for the 80-Core TeraFLOPS Processor , 2011, IEEE Journal of Solid-State Circuits.

[19]  Ying Gao,et al.  SurfNoC: a low latency and provably non-interfering approach to secure networks-on-chip , 2013, ISCA.

[20]  James D. Meindl,et al.  Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration , 2002, IEEE J. Solid State Circuits.

[21]  Federico Silla,et al.  On the Impact of Within-Die Process Variation in GALS-Based NoC Performance , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[22]  Kees G. W. Goossens,et al.  Virtual execution platforms for mixed-time-criticality systems: the CompSOC architecture and design flow , 2013, SIGBED.

[23]  Siddharth Garg,et al.  Process-Driven Variability Analysis of Single and Multiple Voltage–Frequency Island Latency-Constrained Systems , 2008, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[24]  A.B. Kahng,et al.  Impact of Guardband Reduction On Design Outcomes: A Quantitative Approach , 2009, IEEE Transactions on Semiconductor Manufacturing.

[25]  Kees G. W. Goossens,et al.  The aethereal network on chip after ten years: Goals, evolution, lessons, and future , 2010, Design Automation Conference.

[26]  Shuvra S. Bhattacharyya,et al.  Embedded Multiprocessors: Scheduling and Synchronization , 2000 .

[27]  Gerard J. M. Smit,et al.  Efficient Computation of Buffer Capacities for Cyclo-Static Dataflow Graphs , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[28]  S. Borkar,et al.  An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[29]  Jakob Engblom,et al.  The worst-case execution-time problem—overview of methods and survey of tools , 2008, TECS.

[30]  Twan Basten,et al.  Task-level timing models for guaranteed performance in multiprocessor networks-on-chip , 2003, CASES '03.

[31]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[32]  Borivoje Nikolic,et al.  Measurement and analysis of variability in 45nm strained-Si CMOS technology , 2008, 2008 IEEE Custom Integrated Circuits Conference.

[33]  Wolfgang Fichtner,et al.  Practical design of globally-asynchronous locally-synchronous systems , 2000, Proceedings Sixth International Symposium on Advanced Research in Asynchronous Circuits and Systems (ASYNC 2000) (Cat. No. PR00586).

[34]  Todor Stefanov,et al.  A methodology for automated design of hard-real-time embedded streaming systems , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[35]  Kees van Berkel,et al.  Multi-core for mobile phones , 2009, DATE.

[36]  Kees G. W. Goossens,et al.  Process-variation-aware mapping of best-effort and real-time streaming applications to MPSoCs , 2014, TECS.

[37]  Sander Stuijk,et al.  Throughput analysis and Voltage-Frequency Island partitioning for streaming applications under process variation , 2013, The 11th IEEE Symposium on Embedded Systems for Real-time Multimedia.

[38]  Sander Stuijk,et al.  Exploring trade-offs in buffer requirements and throughput constraints for synchronous dataflow graphs , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[39]  Alexander V. Rylyakov,et al.  A wide tuning range (1 GHz-to-15 GHz) fractional-N all-digital PLL in 45nm SOI , 2008, 2008 IEEE Custom Integrated Circuits Conference.

[40]  Hannu Tenhunen,et al.  Globally asynchronous locally synchronous architecture for large high-performance ASICs , 1999, ISCAS'99. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI (Cat. No.99CH36349).

[41]  Siddharth Garg,et al.  System-level throughput analysis for process variation aware multiple voltage-frequency island designs , 2008, TODE.