On reducing load/store latencies of cache accesses

Effective address calculations for load and store instructions need to compete for ALU with other instructions and hence extra latencies might be incurred to data cache accesses. Fast address generation is an approach proposed to reduce cache access latencies. This paper presents a fast address generator that can eliminate most of the effective address computations by storing computed effective addresses of previous load/store instructions in a dummy register file. Experimental results show that this fast address generator can reduce effective address computations of load and store instructions by about 74% on average for SPECint2000 benchmarks and cut the execution times by 8.5%. Furthermore, when multiple dummy register files are deployed, this fast address generator eliminates over 90% of effective address computations of load and store instructions and improves the average execution times by 9.3%.

[1]  Aneesh Aggarwal Reducing latencies of pipelined cache accesses through set prediction , 2005, ICS '05.

[2]  Eduard Ayguadé,et al.  Dynamic memory instruction bypassing , 2003, ICS '03.

[3]  Stéphan Jourdan,et al.  Early load address resolution via register tracking , 2000, ISCA '00.

[4]  Kunle Olukotun,et al.  Multilevel Optimization of Pipelined Caches , 1997, IEEE Trans. Computers.

[5]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[6]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[7]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[8]  Gerry Kane,et al.  MIPS R2000 RISC architecture , 1987 .

[9]  Todd C. Mowry,et al.  Tolerating latency in multiprocessors through compiler-inserted prefetching , 1998, TOCS.

[10]  Chung-Ping Chung,et al.  Early load: Hiding load latency in deep pipeline processor , 2008, 2008 13th Asia-Pacific Computer Systems Architecture Conference.

[11]  Todd C. Mowry,et al.  Architectural and compiler support for effective instruction prefetching: a cooperative approach , 2001, TOCS.

[12]  A. Nicolau,et al.  Reducing data cache energy consumption via cached load/store queue , 2003, Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003. ISLPED '03..

[13]  Milo M. K. Martin,et al.  Scalable store-load forwarding via store queue index prediction , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[14]  Jia-Jhe Li,et al.  Snug set-associative caches: Reducing leakage power of instruction and data caches with no performance penalties , 2007, TACO.

[15]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[17]  Uri C. Weiser,et al.  Correlated load-address predictors , 1999, ISCA.

[18]  Craig B. Zilles,et al.  Decomposing the load-store queue by function for power reduction and scalability , 2006, IBM J. Res. Dev..

[19]  Donald Yeung,et al.  A study of source-level compiler algorithms for automatic construction of pre-execution code , 2004, TOCS.

[20]  Erik R. Altman,et al.  Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture , 2002, MICRO 2002.

[21]  Jia-Jhe Li,et al.  Snug set-associative caches. Reducing leakage power while improving performance , 2005, ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005..

[22]  David A. Patterson,et al.  Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) , 2008 .

[23]  Narayanan Vijaykrishnan,et al.  On load latency in low-power caches , 2003, ISLPED '03.

[24]  Alexander V. Veidenbaum,et al.  Reducing data cache energy consumption via cached load/store queue , 2003, ISLPED '03.

[25]  Narayanan Vijaykrishnan,et al.  Exploiting temporal loads for low latency and high bandwidth memory , 2005 .

[26]  Chuanjun Zhang Reducing cache misses through programmable decoders , 2008, TACO.

[27]  Dionisios N. Pnevmatikatos,et al.  Streamlining data cache access with fast address calculation , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[28]  Dirk Grunwald,et al.  Predictive sequential associative cache , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[29]  Brad Calder,et al.  Pointer cache assisted prefetching , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[30]  Narayanan Vijaykrishnan,et al.  Reducing non-deterministic loads in low-power caches via early cache set resolution , 2007, Microprocess. Microsystems.

[31]  Donald J. Patterson,et al.  Computer organization and design: the hardware-software interface (appendix a , 1993 .

[32]  Chung-Ho Chen,et al.  Microarchitecture support for improving the performance of load target prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[33]  Lu Peng,et al.  Signature buffer: bridging performance gap between registers and caches , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[34]  T. N. Vijaykumar,et al.  Reducing design complexity of the load/store queue , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..