Filter Caching for Free: The Untapped Potential of the Store-Buffer

Modern processors contain store-buffers to allow stores to retire under a miss, thus hiding store-miss latency. The store-buffer needs to be large (for performance) and searched on every load (for correctness), thereby making it a costly structure in both area and energy. Yet on every load, the store-buffer is probed in parallel with the L1 and TLB, with no concern for the store-buffer's intrinsic hit rate or whether a store-buffer hit can be predicted to save energy by disabling the L1 and TLB probes. In this work we cache data that have been written back to memory in a unified store-queue/buffer/cache, and predict hits to avoid L1/TLB probes and save energy. By dynamically adjusting the allocation of entries between the store-queue/buffer/cache, we can achieve nearly optimal reuse, without causing stalls. We are able to do this efficiently and cheaply by recognizing key properties of stores: free caching (since they must be written into the store-buffer for correctness we need no additional data movement), cheap coherence (since we only need to track state changes of the local, dirty data in the store-buffer), and free and accurate hit prediction (since the memory dependence predictor already does this for scheduling). As a result, we are able to increase the store-buffer hit rate and reduce store-buffer/TLB/L1 dynamic energy by 11.8% (up to 26.4%) on SPEC2006 without hurting performance (average IPC improvements of 1.5%, up to 4.7%). The cost for these improvements is a 0.2% increase in L1 cache capacity (1 bit per line) and one additional tail pointer in the store-buffer.

[1]  Alexander V. Veidenbaum,et al.  Reducing data cache energy consumption via cached load/store queue , 2003, ISLPED '03.

[2]  Francisco Tirado,et al.  L1 Data Cache Power Reduction Using a Forwarding Predictor , 2010, PATMOS.

[3]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[4]  Kaushik Roy,et al.  Reducing set-associative cache energy via way-prediction and selective direct-mapping , 2001, MICRO.

[5]  Erik Hagersten,et al.  Cost-effective speculative scheduling in high performance processors , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[6]  Todd M. Austin,et al.  Cyclone: a broadcast-free dynamic instruction scheduler with selective replay , 2003, ISCA '03.

[7]  R. E. Kessler,et al.  Inexpensive implementations of set-associativity , 1989, ISCA '89.

[8]  Kazuaki Murakami,et al.  Way-predicting set-associative cache for high performance and low energy consumption , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).

[9]  Milo M. K. Martin,et al.  NoSQ: Store-Load Communication without a Store Queue , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[10]  Nikos Nikoleris,et al.  Addressing Energy Challenges in Filter Caches , 2017, 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[11]  Margaret Martonosi,et al.  COATCheck: Verifying Memory Ordering at the Hardware-OS Interface , 2016, ASPLOS.

[12]  Glenn Reinman,et al.  Scaling the issue window with look-ahead latency prediction , 2004, ICS '04.

[13]  Koen De Bosschere,et al.  2FAR: A 2bcgskew Predictor Fused by an Alloyed Redundant History Skewed Perceptron Branch Predictor , 2005, J. Instr. Level Parallelism.

[14]  Stefanos Kaxiras,et al.  Non-Speculative Store Coalescing in Total Store Order , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[15]  Roberto Giorgi,et al.  Reducing leakage in power-saving capable caches for embedded systems by using a filter cache , 2007, MEDEA '07.

[16]  David Black-Schaffer,et al.  Dynamically Disabling Way-prediction to Reduce Instruction Replay , 2018, 2018 IEEE 36th International Conference on Computer Design (ICCD).

[17]  T. N. Vijaykumar,et al.  Reducing Design Complexity of the Load/Store Queue , 2003, MICRO.

[18]  Sam Ainsworth,et al.  Graph Prefetching Using Data Structure Knowledge , 2016, ICS.

[19]  Kevin Skadron,et al.  Design issues and tradeoffs for write buffers , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[20]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[21]  Stefanos Kaxiras,et al.  Applying Decay to Reduce Dynamic Power in Set-Associative Caches , 2007, HiPEAC.

[22]  Dirk Grunwald,et al.  Predictive sequential associative cache , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[23]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[24]  Stefanos Kaxiras,et al.  The Superfluous Load Queue , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Stefanos Kaxiras,et al.  Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[26]  Stefanos Kaxiras,et al.  Racer: TSO consistency via race detection , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Andreas Moshovos,et al.  Dynamic Speculation and Synchronization of Data Dependences , 1997, ISCA.

[28]  Pierre Michaud,et al.  Data-flow prescheduling for large instruction windows in out-of-order processors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[29]  William H. Mangione-Smith,et al.  The filter cache: an energy efficient memory structure , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[30]  Andreas Moshovos,et al.  Streamlining inter-operation memory communication via data dependence prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[31]  Ibrahim N. Hajj,et al.  Using dynamic cache management techniques to reduce energy in a high-performance processor , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).

[32]  Glenn Reinman,et al.  Precise Instruction Scheduling , 2005, J. Instr. Level Parallelism.

[33]  Frank Vahid,et al.  A Way-Halting Cache for Low-Energy High-Performance Systems , 2005, IEEE Computer Architecture Letters.

[34]  Hsien-Hsin S. Lee,et al.  Way guard: a segmented counting bloom filter approach to reducing energy for set-associative caches , 2009, ISLPED.

[35]  Stéphan Jourdan,et al.  Speculation techniques for improving load related instruction scheduling , 1999, ISCA.

[36]  Joel S. Emer,et al.  Memory dependence prediction using store sets , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[37]  Gabriel H. Loh,et al.  Fire-and-Forget: Load/Store Scheduling with No Store Queue at All , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[38]  Kimming So,et al.  Cache Operations by MRU Change , 1988, IEEE Trans. Computers.

[39]  Milo M. K. Martin,et al.  Scalable store-load forwarding via store queue index prediction , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).