Fire-and-Forget: Load/Store Scheduling with No Store Queue at All

Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory scheduling and store-to-load forwarding. However, the LQ and SQ scale poorly for the sizes required for large-window, high-ILP processors. Past research has proposed ways to make the SQ more scalable by reorganizing the CAMs or using non-associative structures. In particular, the store queue index prediction (SQIP) approach allows for load instructions to predict the exact SQ index of a sourcing store and access the SQ in a much simpler and more scalable RAM-based fashion. The reason why SQIP works is that loads that receive data directly from stores will usually receive the data from the same store each time. In our work, we take a slightly different view on the underlying observation used by SQIP: a store that forwards data to a load usually forwards to the same load each time. This subtle change in perspective leads to our "fire-and-forget" (FnF) scheme for load/store scheduling and forwarding that results in the complete elimination of the store queue. The idea is that stores issue out of the reservation stations like regular instructions, and any store that forwards data to a load will use a predicted LQ index to directly write the value to the LQ entry without any associative logic. Any mispredictions/misforwardings are detected by a low-overhead pre-commit re-execution mechanism. Our original goal for FnF was to design a more scalable memory scheduling microarchitecture than the previously proposed approaches without degrading performance. The relative infrequency of store-to-load forwarding, accurate LQ index prediction, and speculative cloaking actually combine to enable FnF to slightly out-perform the competition. Specifically, our simulation results show that our SQ-less fire-and-forget provides a 3.3% speedup over a processor using a conventional fully-associative SQ

[1]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[2]  Sam S. Stone,et al.  Address-indexed memory disambiguation and store-to-load forwarding , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[3]  Simha Sethumadhavan,et al.  Scalable hardware memory disambiguation for high-ILP processors , 2003, IEEE Micro.

[4]  Craig B. Zilles,et al.  Decomposing the load-store queue by function for power reduction and scalability , 2006, IBM J. Res. Dev..

[5]  Milo M. K. Martin,et al.  Scalable store-load forwarding via store queue index prediction , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[6]  Víctor Viñals,et al.  Store buffer design in first-level multibanked data caches , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[7]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[8]  Andreas Moshovos,et al.  Streamlining inter-operation memory communication via data dependence prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[9]  Mikko H. Lipasti,et al.  Silent stores for free , 2000, MICRO 33.

[10]  Gurindar S. Sohi,et al.  ARB: A Hardware Mechanism for Dynamic Reordering of Memory References , 1996, IEEE Trans. Computers.

[11]  R. D. Valentine,et al.  The Intel Pentium M processor: Microarchitecture and performance , 2003 .

[12]  Jun Yang,et al.  Frequent value locality and value-centric data cache design , 2000, SIGP.

[13]  Milo M. K. Martin,et al.  NoSQ: Store-Load Communication without a Store Queue , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[14]  Mikko H. Lipasti,et al.  Modern Processor Design: Fundamentals of Superscalar Processors , 2002 .

[15]  Brad Calder,et al.  Picking statistically valid and early simulation points , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[16]  Amir Roth,et al.  Store vulnerability window (SVW): re-execution filtering for enhanced load optimization , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[17]  Todd M. Austin,et al.  MASE: a novel infrastructure for detailed microarchitectural modeling , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[18]  Gabriel H. Loh,et al.  Memory Bypassing: Not Worth the Effort , 2002 .

[19]  Aamer Jaleel,et al.  Using virtual load/store queues (VLSQs) to reduce the negative effects of reordered memory instructions , 2005, 11th International Symposium on High-Performance Computer Architecture.

[20]  Joel S. Emer,et al.  Memory dependence prediction using store sets , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[21]  Gabriel H. Loh,et al.  Store vectors for scalable memory dependence prediction and scheduling , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[22]  YangJun,et al.  Frequent value locality and value-centric data cache design , 2000 .

[23]  Mikko H. Lipasti,et al.  Memory ordering: a value-based approach , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[24]  Andreas Moshovos,et al.  Speculative Memory Cloaking and Bypassing , 1999, International Journal of Parallel Programming.

[25]  Mikko H. Lipasti,et al.  Memory Ordering: A Value-Based Approach , 2004, ISCA 2004.

[26]  Todd M. Austin,et al.  Efficient detection of all pointer and array access errors , 1994, PLDI '94.

[27]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .