Scalable Load and Store Processing in Latency-Tolerant Processors

Memory latency tolerant architectures achieve high performance by supporting thousands of in-flight instructions without scaling cycle-critical processor resources. We present new load-store processing algorithms for latency tolerant architectures. We augment primary load and store queues with secondary buffers. The secondary load buffer is a set associative structure, similar to a cache. The secondary store queue, the store redo log (SRL) is a first-in first-out (FIFO) structure recording the program order of all stores completed in parallel with a miss, and has no CAM and search functions. Instead of the secondary store queue, a cache provides temporary forwarding. The SRL enforces memory ordering by ensuring memory updates occur in program order once the miss data arrives from memory. The new algorithms remove fundamental sources of power, and area inefficiency in load and store processing by eliminating the CAM and search functions in the secondary load and store buffers, and still achieve competitive performance compared to hierarchical designs

[1]  Gurindar S. Sohi,et al.  Speculative Versioning Cache , 2001, IEEE Trans. Parallel Distributed Syst..

[2]  Josep Llosa,et al.  Out-of-order commit processors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[3]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[4]  Tong Li,et al.  A large, fast instruction window for tolerating cache misses , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[5]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[6]  T. N. Vijaykumar,et al.  Is SC+ILP=RC? , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[7]  Haitham Akkary,et al.  Checkpoint processing and recovery: towards scalable large instruction window processors , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[8]  Tejas Karkhanis,et al.  A Day in the Life of a Data Cache Miss , 2002 .

[9]  Andreas Moshovos,et al.  Streamlining inter-operation memory communication via data dependence prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[10]  Gurindar S. Sohi,et al.  ARB: A Hardware Mechanism for Dynamic Reordering of Memory References , 1996, IEEE Trans. Computers.

[11]  Mikko H. Lipasti,et al.  Memory ordering: a value-based approach , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[12]  Haitham Akkary,et al.  Continual flow pipelines , 2004, ASPLOS XI.

[13]  Simha Sethumadhavan,et al.  Scalable Hardware Memory Disambiguation for High-ILP Processors , 2004, IEEE Micro.

[14]  T. N. Vijaykumar,et al.  Reducing design complexity of the load/store queue , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[15]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[16]  A. Murthy,et al.  A 90 nm communication technology featuring SiGe HBT transistors, RF CMOS, precision R-L-C RF elements and 1 /spl mu/m2 6-T SRAM cell , 2002, Digest. International Electron Devices Meeting,.