A High-Bandwidth Load-Store Unit for Single- and Multi-Threaded Processors

A store queue (SQ) is a critical component of the load execution machinery. High ILP processors require high load execution bandwidth, but providing high bandwidth SQ access is difficult. Address banking, which works well for caches, conflicts with age-ordering which is required for the SQ and multi-porting exacerbates the latency of the associative searches that load execution requires. In this paper, we present a new high-bandwidth load-store unit design that exploits the predictability of forwarding behavior. To start with, a simple predictor filters loads that are not likely to require forwarding from accessing the SQ enabling a reduction in the number of associative ports. A subset of the loads that do not access the SQ are re-executed prior to retirement to detect over-aggressive filtering and train the predictor. A novel adaptation of a Bloom filter keeps the re-execution subset minimal. Next, the same predictor filters stores that don't forward values to nearby loads from the SQ enabling a substantial capacity reduction. To enable this optimization and maintain in-order store retirement, we add a second SQ that contains all stores, but only to retirement and Bloom filter management; this queue is large but isn’t associatively searched. Finally, to boost both load and store filtering and to handle programs with heavy forwarding bandwidth requirements we add a second, address-banked forwarding structure that handles "easy" forwarding instances, leaving the globally-ordered SQ to handle only "tricky" cases. Our design does not directly address load queue scalability, but does dovetail with a recent proposal that also uses re-execution to tackle this issue. Performance simulations on SPEC2000 and MediaBench benchmarks show that our design comes within 2% (7% in the worst case) of the performance of an ideal multi-ported SQ, using only a 16-entry queue with a single associative lookup port.

[1]  Joel S. Emer,et al.  Memory dependence prediction using store sets , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[2]  T. N. Vijaykumar,et al.  Reducing Design Complexity of the Load/Store Queue , 2003, MICRO.

[3]  D. Marr,et al.  Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[4]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[5]  Andreas Moshovos,et al.  Dynamic Speculation and Synchronization of Data Dependences , 1997, ISCA.

[6]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[7]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[8]  Haitham Akkary,et al.  Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors , 2003, MICRO.

[9]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[10]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[11]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[12]  Simha Sethumadhavan,et al.  Scalable hardware memory disambiguation for high-ILP processors , 2003, IEEE Micro.

[13]  Todd M. Austin,et al.  Efficient checker processor design , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[14]  Stéphan Jourdan,et al.  Speculation techniques for improving load related instruction scheduling , 1999, ISCA.

[15]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[16]  Steven K. Reinhardt,et al.  The impact of resource partitioning on SMT processors , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[17]  Dean M. Tullsen,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[18]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.