The Superfluous Load Queue

In an out-of-order core, the load queue (LQ), the store queue (SQ), and the store buffer (SB) are responsible for ensuring: i) correct forwarding of stores to loads and ii) correct ordering among loads (with respect to external stores). The first requirement safeguards the sequential semantics of program execution and applies to both serial and parallel code; the second requirement safeguards the semantics of coherence and consistency (e.g., TSO). In particular, loads search the SQ/SB for the latest value that may have been produced by a store, and stores and invalidations search the LQ to find speculative loads in case they violate uniprocessor or multiprocessor ordering. To meet timing constraints the LQ and SQ/SB system is composed of CAM structures that are frequently searched. This results in high complexity, cost, and significant difficulty to scale, but is the current state of the art. Prior research demonstrated the feasibility of a non-associative LQ by replaying loads at commit. There is a steep cost however: a significant increase in L1 accesses and contention for L1 ports. This is because prior work assumes Sequential Consistency and completely ignores the existence of a SB in the system. In contrast, we intentionally delay stores in the SB to achieve a total management of stores and loads in a core, while still supporting TSO. Our main result is that we eliminate the LQ without burdening the L1 with extra accesses. Store forwarding is achieved by delaying our own stores until speculatively issued loads are validated on commit, entirely in-core; TSO load→load ordering is preserved by delaying remote external stores in their SB until our own speculative reordered loads commit. While the latter is inspired by recent work on non-speculative load reordering, our contribution here is to show that this can be accomplished without having a load queue. Eliminating the LQ results in both energy savings and performance improvement from the elimination of LQ-induced stalls.

[1]  C. A. Zukowski,et al.  Use of selective precharge for low-power on the match lines of content-addressable memories , 1997, Proceedings. International Workshop on Memory Technology, Design and Testing (Cat. NO.97TB100159).

[2]  Mikko H. Lipasti,et al.  Memory ordering: a value-based approach , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[3]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[4]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  T. N. Vijaykumar,et al.  Reducing design complexity of the load/store queue , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[6]  Amir Roth,et al.  Store vulnerability window (SVW): re-execution filtering for enhanced load optimization , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[7]  Josep Torrellas,et al.  SCsafe: Logging sequential consistency violations continuously and precisely , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[8]  Stefanos Kaxiras,et al.  Non-Speculative Store Coalescing in Total Store Order , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[9]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[10]  David H. Albonesi,et al.  An old-est-first selection logic implementation for non-compacting issue queues , 2002 .

[11]  Stefanos Kaxiras,et al.  Splash-3: A properly synchronized benchmark suite for contemporary research , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[12]  Jean-Loup Baer,et al.  Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors , 2009 .

[13]  Gürhan Küçük,et al.  Energy: efficient instruction dispatch buffer design for superscalar processors , 2001, ISLPED '01.

[14]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[15]  Mikko H. Lipasti,et al.  Deconstructing commit , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[16]  Jung Ho Ahn,et al.  CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[17]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[18]  Andreas Moshovos,et al.  Streamlining inter-operation memory communication via data dependence prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[19]  Stéphan Jourdan,et al.  Speculation techniques for improving load related instruction scheduling , 1999, ISCA.

[20]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[21]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  Stefanos Kaxiras,et al.  Non-speculative load-load reordering in TSO , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[23]  K. Pagiamtzis,et al.  Content-addressable memory (CAM) circuits and architectures: a tutorial and survey , 2006, IEEE Journal of Solid-State Circuits.

[24]  Victor V. Zyuban,et al.  Optimization of high-performance superscalar architectures for energy efficiency , 2000, ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514).