SAMO: store aware memory optimizations

Cache optimizations and DRAM scheduling play an important role in determining the performance of a system given that the demand for memory is ever increasing. In this paper we track stores both at cache and main memory and apply three different optimizations, one, at the cache level, so that stores are serviced faster and hence load store queue block cycles are reduced, two, at the miss handling architecture wherein we remove entries containing only store requests thereby reducing the cache stall cycles and three, at the main memory where stores are serviced with lesser priority so that actual reads get serviced faster. These three different memory optimizations combined together (store aware memory optimization, SAMO framework) on an average increase the performance of the system and can be augmented with any previously proposed optimization techniques at the memory. SAMO speeds-up the workloads on 4- and 8-core systems by a geometric mean of 5.0% and 7.4%, respectively, with a maximum speed-up of 21.9% and 17.8% on 4- and 8-core systems, respectively.

[1]  Josep Torrellas,et al.  Scalable Cache Miss Handling for High Memory-Level Parallelism , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[2]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[3]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[4]  Thomas F. Wenisch,et al.  Mechanisms for store-wait-free multiprocessors , 2007, ISCA '07.

[5]  Kevin Kai-Wei Chang,et al.  Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[6]  Milo M. K. Martin,et al.  NoSQ: Store-Load Communication without a Store Queue , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[7]  Sai Prashanth Muralidhara,et al.  Reducing memory interference in multicore systems via application-aware memory channel partitioning , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Víctor Viñals,et al.  Store buffer design in first-level multibanked data caches , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[9]  Gabriel H. Loh,et al.  Criticality-based optimizations for efficient load processing , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[10]  Mikko H. Lipasti,et al.  Modern Processor Design: Fundamentals of Superscalar Processors , 2002 .

[11]  Gabriel H. Loh,et al.  Fire-and-Forget: Load/Store Scheduling with No Store Queue at All , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[12]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[13]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[14]  Simha Sethumadhavan,et al.  Scalable hardware memory disambiguation for high-ILP processors , 2003, IEEE Micro.

[15]  Milo M. K. Martin,et al.  Scalable store-load forwarding via store queue index prediction , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[16]  José F. Martínez,et al.  MORSE: Multi-objective reconfigurable self-optimizing memory scheduler , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[17]  A. Snavely,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[18]  Aamer Jaleel,et al.  Using virtual load/store queues (VLSQs) to reduce the negative effects of reordered memory instructions , 2005, 11th International Symposium on High-Performance Computer Architecture.

[19]  Chris Fallin,et al.  Parallel application memory scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Santosh G. Abraham,et al.  Store memory-level parallelism optimizations for commercial applications , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[21]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[22]  Mor Harchol-Balter,et al.  ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[23]  W. Marsden I and J , 2012 .

[24]  Calvin Lin,et al.  Adaptive History-Based Memory Schedulers for Modern Processors , 2006, IEEE Micro.

[25]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[26]  José F. Martínez,et al.  Improving memory scheduling via processor-side load criticality information , 2013, ISCA.