LMStr: exploring shared hardware controlled scratchpad memory for multicores

In this paper, we present an on-chip memory store called "Local Memory Store (LMStr)"which can be used with a regular cache hierarchy or solely as a redesigned scratchpad memory (SPM). The LMStr is a shared special kind of a SPM among the cores in a multicore processor. This memory hierarchy is hardware-controlled in terms of management of the store itself. Yet, compiler support is instrumental in deciding which data items/types should live in the store. Critical data should be stored in the LMStr according to its type (i.e., local, global, static, or temporary). The programmer can provide, at will, hints to the compiler to place certain data items in the LMStr. We evaluate our design using a matrix multiplication micro-application and multiple Mantevo mini-applications. Our results show that LMStr improves data movement by up to 21% compared to cache alone with a mere 3% area overhead. Not only that but LMStr improves the cycles per memory access by up to 40%. It also projects up to 85% less dynamic energy consumption compared to traditional cache.

[1]  Nikil D. Dutt,et al.  Efficient utilization of scratch-pad memory in embedded processor applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.

[2]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[3]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[4]  José V. Busquets-Mataix,et al.  Architecture Extensions for Efficient Management of Scratch-Pad Memory , 2011, PATMOS.

[5]  Nikil D. Dutt,et al.  Memory Architectures for Embedded Systems-On-Chip , 2002, HiPC.

[6]  Tulika Mitra,et al.  Integrated scratchpad memory optimization and task scheduling for MPSoC architectures , 2006, CASES '06.

[7]  Eduard Ayguadé,et al.  Hardware-software coherence protocol for the coexistence of caches and local memories , 2012, HiPC 2012.

[8]  Eduard Ayguadé,et al.  Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[9]  Aviral Shrivastava,et al.  Efficient Code Assignment Techniques for Local Memory on Software Managed Multicores , 2015, TECS.

[10]  Aviral Shrivastava,et al.  Stack data management for Limited Local Memory (LLM) multi-core processors , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.

[11]  Sarita V. Adve,et al.  Stash: Have your scratchpad and cache it too , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[12]  Abdel-Hameed A. Badawy,et al.  Cache Utilization as a Locality Metric - A Case Study on the Mantevo Suite , 2016, 2016 International Conference on Computational Science and Computational Intelligence (CSCI).

[13]  Eduard Ayguadé,et al.  Hardware–Software Coherence Protocol for the Coexistence of Caches and Local Memories , 2012, IEEE Transactions on Computers.

[14]  Abdel-Hameed A. Badawy,et al.  LMStr: Local memory store the case for hardware controlled scratchpad memory for general purpose processors , 2016, 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC).

[15]  Donald Yeung,et al.  Evaluating the impact of memory system performance on software prefetching and locality optimizations , 2001, ICS '01.

[16]  Wei Zhang,et al.  Scratchpad Memory Architectures and Allocation Algorithms for Hard Real-Time Multicore Processors , 2015, J. Comput. Sci. Eng..

[17]  Sri Parameswaran,et al.  Hardware/software managed scratchpad memory for embedded system , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[18]  Gokcen Kestor,et al.  Quantifying the energy cost of data movement in scientific applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[19]  Peter Marwedel,et al.  Comparison of Cache- and Scratch-Pad based Memory Systems with respect to Performance, Area and Energy Consumption , 2007 .

[20]  Aviral Shrivastava,et al.  A Software-Only Solution to Use Scratch Pads for Stack Data , 2009, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[21]  Nikil D. Dutt,et al.  On-chip vs. off-chip memory: the data partitioning problem in embedded processor-based systems , 2000, TODE.

[22]  Milo M. K. Martin,et al.  Why on-chip cache coherence is here to stay , 2012, Commun. ACM.

[23]  Jason Cong,et al.  An energy-efficient adaptive hybrid cache , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[24]  Isabelle Puaut,et al.  Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[25]  Meikang Qiu,et al.  Data Placement and Duplication for Embedded Multicore Systems With Scratch Pad Memory , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[26]  Rajeev Barua,et al.  An optimal memory allocation scheme for scratch-pad-based embedded systems , 2002, TECS.

[27]  Soonhoi Ha,et al.  ILP based data parallel multi-task mapping/scheduling technique for MPSoC , 2008, 2008 International SoC Design Conference.

[28]  Donald Yeung,et al.  The Efficacy of Software Prefetching and Locality Optimizations on Future Memory Systems , 2004, J. Instr. Level Parallelism.

[29]  Lin Gao,et al.  Memory coloring: a compiler approach for scratchpad memory management , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[30]  Peter Marwedel,et al.  Assigning program and data objects to scratchpad for energy reduction , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[31]  Peter Marwedel,et al.  Compiler-optimized usage of partitioned memories , 2004, WMPI '04.

[32]  Sandhya Dwarkadas,et al.  Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[33]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[34]  Sumesh Udayakumaran,et al.  Compiler-decided dynamic memory allocation for scratch-pad based embedded systems , 2003, CASES '03.

[35]  Rajeev Barua,et al.  Memory allocation for embedded systems with a compile-time-unknown scratch-pad size , 2005, CASES '05.

[36]  Aviral Shrivastava,et al.  Automatic and efficient heap data management for Limited Local Memory multicore architectures , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[37]  Aviral Shrivastava,et al.  Heap data management for limited local memory (LLM) multi-core processors , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[38]  Gregory J. Chaitin,et al.  Register allocation & spilling via graph coloring , 1982, SIGPLAN '82.

[39]  Peter Marwedel,et al.  Scratchpad sharing strategies for multiprocess embedded systems: a first approach , 2005, 3rd Workshop on Embedded Systems for Real-Time Multimedia, 2005..

[40]  Abdel-Hameed A. Badawy,et al.  Local memory store (LMStr): A hardware controlled shared scratchpad for multicores , 2017, 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).

[41]  Isabelle Puaut,et al.  Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison , 2007 .

[42]  Meikang Qiu,et al.  Optimal Data Allocation for Scratch-Pad Memory on Embedded Multi-core Systems , 2011, 2011 International Conference on Parallel Processing.

[43]  Bruce Jacob,et al.  The structural simulation toolkit , 2006, PERV.

[44]  Aviral Shrivastava,et al.  SSDM: Smart Stack Data Management for software managed multicores (SMMs) , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).