Hardware-based fast exploration of cache hierarchies in application specific MPSoCs

Multi-level caches are widely used to improve the memory access speed of multiprocessor systems. Deciding on a suitable set of cache memories for an application specific embedded system's memory hierarchy is a tedious problem, particularly in the case of MPSoCs. To accurately determine the number of hits and misses for all the configurations in the design space of an MPSoC, researchers extract the trace first using Instruction set simulators and then simulate using a software simulator. Such simulations take several hours to months. We propose a novel method based on specialized hardware which can quickly simulate the design space of cache configurations for a shared memory multiprocessor system on an FPGA, by analyzing the memory traces and calculating the cache hits and misses simultaneously. We demonstrate that our simulator can explore the cache design space of a quad-core system with private L1 caches and a shared L2 cache, over a range of standard benchmarks, taking as less as 0.106 seconds per million memory accesses, which is up to 456 times faster than the fastest known software based simulator. Since we emulate the program and analyze memory traces simultaneously, we eliminate the need to extract multiple memory access traces prior to simulation, which saves a significant amount of time during the design stage.

[1]  Donald E. Thomas,et al.  High level cache simulation for heterogeneous multiprocessors , 2004, Proceedings. 41st Design Automation Conference, 2004..

[2]  Santosh G. Abraham,et al.  Set-associative cache simulation using generalized binomial trees , 1995, TOCS.

[3]  Sri Parameswaran,et al.  RExCache: Rapid exploration of unified last-level cache , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[4]  Nozomu Togawa,et al.  Exact and fast L1 cache simulation for embedded systems , 2009, 2009 Asia and South Pacific Design Automation Conference.

[5]  Mark Horowitz,et al.  An analytical cache model , 1989, TOCS.

[6]  Jason Cong,et al.  HC-Sim: A fast and exact L1 cache simulator with scratchpad memory co-simulation support , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[7]  Babak Falsafi,et al.  ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs , 2009, TRETS.

[8]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[9]  Ann Gordon-Ross,et al.  T-SPaCS — A two-level single-pass cache simulation methodology , 2011, 16th Asia and South Pacific Design Automation Conference (ASP-DAC 2011).

[10]  Sri Parameswaran,et al.  DIMSim: a rapid two-level cache simulation approach for deadline-based MPSoCs , 2012, CODES+ISSS.

[11]  Ann Gordon-Ross,et al.  CPACT - The conditional parameter adjustment cache tuner for dual-core architectures , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).

[12]  Sri Parameswaran,et al.  Finding optimal L1 cache configuration for embedded systems , 2006, Asia and South Pacific Conference on Design Automation, 2006..

[13]  Sri Parameswaran,et al.  A scorchingly fast FPGA-based Precise L1 LRU cache simulator , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[14]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[15]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[16]  Yajun Ha,et al.  TRISHUL: A single-pass optimal two-level inclusive data cache hierarchy selection process for real-time MPSoCs , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[17]  Sri Parameswaran,et al.  SuSeSim: a fast simulation strategy to find optimal L1 cache configuration for embedded systems , 2009, CODES+ISSS '09.

[18]  Sri Parameswaran,et al.  SCUD: A fast single-pass L1 cache simulation approach for embedded processors with Round-robin replacement policy , 2010, Design Automation Conference.

[19]  Frank Vahid,et al.  A table-based method for single-pass cache optimization , 2008, GLSVLSI '08.