Automatic Generation of Miniaturized Synthetic Proxies for Target Applications to Efficiently Design Multicore Processors

Prohibitive simulation time with pre-silicon design models and unavailability of proprietary target applications make microprocessor design very tedious. The framework proposed in this paper is the first attempt to automatically generate synthetic benchmark proxies for real world multithreaded applications. The framework includes metrics that characterize the behavior of the workloads in the shared caches, coherence logic, out-of-order cores, interconnection network and DRAM. The framework is evaluated by generating proxies for the workloads in the multithreaded PARSEC benchmark suite and validating their fidelity by comparing the microarchitecture dependent and independent metrics to that of the original workloads. The average error in IPC is 4.87 percent and maximum error is 10.8 percent for Raytrace in comparison to the original workloads. The average error in the power-per-cycle metric is 2.73 percent with a maximum of 5.5 percent when compared to original workloads. The representativeness of the proxies to that of the original workloads in terms of their sensitivity to design changes is evaluated by finding the correlation coefficient between the trends followed by the synthetic and the original for design changes in IPC, which is 0.92. A speedup of four to six orders of magnitude is achieved by using the synthetic proxies over the original workloads.

[1]  James E. Smith,et al.  Modeling superscalar processors via statistical simulation , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[2]  John B. Carter,et al.  An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[3]  Lizy Kurian John,et al.  MAximum Multicore POwer (MAMPO) — An automatic multithreaded synthetic power virus generation framework for multicore systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  Amir Herzberg,et al.  Public protection of software , 1985, TOCS.

[5]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[6]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[7]  James R. Larus,et al.  Compiler-directed Shared-Memory Communication for Iterative Parallel Applications , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[8]  John J. Marciniak,et al.  Encyclopedia of Software Engineering , 1994, Encyclopedia of Software Engineering.

[9]  Wing Shing Wong,et al.  Benchmark Synthesis Using the LRU Cache Hit Function , 1988, IEEE Trans. Computers.

[10]  Lizy Kurian John,et al.  Generation, Validation and Analysis of SPEC CPU2006 Simulation Points Based on Branch, Memory and TLB Characteristics , 2009, SPEC Benchmark Workshop.

[11]  Lizy Kurian John,et al.  A Performance Counter Based Workload Characterization on Blue Gene/P , 2008, 2008 37th International Conference on Parallel Processing.

[12]  Matthew K. Farrens,et al.  Branch transition rate: a new metric for improved branch classification analysis , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[13]  Linda M. Wills,et al.  Reverse Engineering , 1996, Springer US.

[14]  Lizy K. John,et al.  Workload Synthesis for a Communications SoC , 2011 .

[15]  Lieven Eeckhout,et al.  Distilling the essence of proprietary workloads into miniature benchmarks , 2008, TACO.

[16]  Lizy Kurian John,et al.  Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite , 2007, ISCA '07.

[17]  Simon W. Moore,et al.  A communication characterisation of Splash-2 and Parsec , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[18]  Brad Calder,et al.  A co-phase matrix to guide simultaneous multithreading simulation , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[19]  Yan Solihin,et al.  WEST: Cloning data cache behavior using Stochastic Traces , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[20]  Aamer Jaleel,et al.  DRAMsim: a memory system simulator , 2005, CARN.

[21]  Zhanpeng Jin,et al.  ImplantBench: Characterizing and Projecting Representative Benchmarks for Emerging Bioimplantable Computing , 2008, IEEE Micro.

[22]  Roland E. Wunderlich,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[23]  Greg Hamerly,et al.  SimPoint 3.0: Faster and More Flexible Program Analysis , 2005 .

[24]  Lieven Eeckhout,et al.  Control flow modeling in statistical simulation for accurate and efficient processor design studies , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[25]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[26]  Lieven Eeckhout,et al.  Automated microprocessor stressmark generation , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[27]  Lizy Kurian John,et al.  Synthesizing memory-level parallelism aware miniature clones for SPEC CPU2006 and ImplantBench workloads , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[28]  Lizy Kurian John,et al.  Automatic testcase synthesis and performance model validation for high performance PowerPC processors , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[29]  Anand Sivasubramaniam,et al.  Architectural Mechanisms for Explicit Communication in Shared Memory Multiprocessors , 1995, SC.

[30]  Lieven Eeckhout,et al.  Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks , 2006, 2006 IEEE International Symposium on Workload Characterization.

[31]  Michael C. Huang,et al.  Improving support for locality and fine-grain sharing in chip multiprocessors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[33]  Lizy Kurian John,et al.  System-level Max POwer (SYMPO) - a systematic approach for escalating system-level power consumption using synthetic benchmarks , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[34]  Karthik Ganesan Automatic generation of synthetic workloads for multicore systems , 2011 .

[35]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[36]  Frederic T. Chong,et al.  HLS: combining statistical and symbolic simulation to guide microprocessor designs , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[37]  Brad Calder,et al.  How to use SimPoint to pick simulation points , 2004, PERV.

[38]  Lizy Kurian John,et al.  Improved automatic testcase synthesis for performance model validation , 2005, ICS '05.