论文信息 - Filter Caching for Free: The Untapped Potential of the Store-Buffer

Filter Caching for Free: The Untapped Potential of the Store-Buffer

Modern processors contain store-buffers to allow stores to retire under a miss, thus hiding store-miss latency. The store-buffer needs to be large (for performance) and searched on every load (for correctness), thereby making it a costly structure in both area and energy. Yet on every load, the store-buffer is probed in parallel with the L1 and TLB, with no concern for the store-buffer's intrinsic hit rate or whether a store-buffer hit can be predicted to save energy by disabling the L1 and TLB probes. In this work we cache data that have been written back to memory in a unified store-queue/buffer/cache, and predict hits to avoid L1/TLB probes and save energy. By dynamically adjusting the allocation of entries between the store-queue/buffer/cache, we can achieve nearly optimal reuse, without causing stalls. We are able to do this efficiently and cheaply by recognizing key properties of stores: free caching (since they must be written into the store-buffer for correctness we need no additional data movement), cheap coherence (since we only need to track state changes of the local, dirty data in the store-buffer), and free and accurate hit prediction (since the memory dependence predictor already does this for scheduling). As a result, we are able to increase the store-buffer hit rate and reduce store-buffer/TLB/L1 dynamic energy by 11.8% (up to 26.4%) on SPEC2006 without hurting performance (average IPC improvements of 1.5%, up to 4.7%). The cost for these improvements is a 0.2% increase in L1 cache capacity (1 bit per line) and one additional tail pointer in the store-buffer.

David Black-Schaffer | Stefanos Kaxiras | Alberto Ros | Ricardo Alves

[1] Alexander V. Veidenbaum,et al. Reducing data cache energy consumption via cached load/store queue , 2003, ISLPED '03.

[2] Francisco Tirado,et al. L1 Data Cache Power Reduction Using a Forwarding Predictor , 2010, PATMOS.

[3] Sarita V. Adve,et al. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[4] Kaushik Roy,et al. Reducing set-associative cache energy via way-prediction and selective direct-mapping , 2001, MICRO.

[5] Erik Hagersten,et al. Cost-effective speculative scheduling in high performance processors , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[6] Todd M. Austin,et al. Cyclone: a broadcast-free dynamic instruction scheduler with selective replay , 2003, ISCA '03.

[7] R. E. Kessler,et al. Inexpensive implementations of set-associativity , 1989, ISCA '89.

[8] Kazuaki Murakami,et al. Way-predicting set-associative cache for high performance and low energy consumption , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).

[9] Milo M. K. Martin,et al. NoSQ: Store-Load Communication without a Store Queue , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[10] Nikos Nikoleris,et al. Addressing Energy Challenges in Filter Caches , 2017, 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[11] Margaret Martonosi,et al. COATCheck: Verifying Memory Ordering at the Hardware-OS Interface , 2016, ASPLOS.

[12] Glenn Reinman,et al. Scaling the issue window with look-ahead latency prediction , 2004, ICS '04.

[13] Koen De Bosschere,et al. 2FAR: A 2bcgskew Predictor Fused by an Alloyed Redundant History Skewed Perceptron Branch Predictor , 2005, J. Instr. Level Parallelism.

[14] Stefanos Kaxiras,et al. Non-Speculative Store Coalescing in Total Store Order , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[15] Roberto Giorgi,et al. Reducing leakage in power-saving capable caches for embedded systems by using a filter cache , 2007, MEDEA '07.

[16] David Black-Schaffer,et al. Dynamically Disabling Way-prediction to Reduce Instruction Replay , 2018, 2018 IEEE 36th International Conference on Computer Design (ICCD).

[17] T. N. Vijaykumar,et al. Reducing Design Complexity of the Load/Store Queue , 2003, MICRO.

[18] Sam Ainsworth,et al. Graph Prefetching Using Data Structure Knowledge , 2016, ICS.

[19] Kevin Skadron,et al. Design issues and tradeoffs for write buffers , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[20] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.

[21] Stefanos Kaxiras,et al. Applying Decay to Reduce Dynamic Power in Set-Associative Caches , 2007, HiPEAC.

[22] Dirk Grunwald,et al. Predictive sequential associative cache , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[23] Christian Bienia,et al. Benchmarking modern multiprocessors , 2011 .

[24] Stefanos Kaxiras,et al. The Superfluous Load Queue , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25] Stefanos Kaxiras,et al. Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[26] Stefanos Kaxiras,et al. Racer: TSO consistency via race detection , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27] Andreas Moshovos,et al. Dynamic Speculation and Synchronization of Data Dependences , 1997, ISCA.

[28] Pierre Michaud,et al. Data-flow prescheduling for large instruction windows in out-of-order processors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[29] William H. Mangione-Smith,et al. The filter cache: an energy efficient memory structure , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[30] Andreas Moshovos,et al. Streamlining inter-operation memory communication via data dependence prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[31] Ibrahim N. Hajj,et al. Using dynamic cache management techniques to reduce energy in a high-performance processor , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).

[32] Glenn Reinman,et al. Precise Instruction Scheduling , 2005, J. Instr. Level Parallelism.

[33] Frank Vahid,et al. A Way-Halting Cache for Low-Energy High-Performance Systems , 2005, IEEE Computer Architecture Letters.

[34] Hsien-Hsin S. Lee,et al. Way guard: a segmented counting bloom filter approach to reducing energy for set-associative caches , 2009, ISLPED.

[35] Stéphan Jourdan,et al. Speculation techniques for improving load related instruction scheduling , 1999, ISCA.

[36] Joel S. Emer,et al. Memory dependence prediction using store sets , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[37] Gabriel H. Loh,et al. Fire-and-Forget: Load/Store Scheduling with No Store Queue at All , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[38] Kimming So,et al. Cache Operations by MRU Change , 1988, IEEE Trans. Computers.

[39] Milo M. K. Martin,et al. Scalable store-load forwarding via store queue index prediction , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).