论文信息 - Performance Optimization of Pipelined Primary Caches

Performance Optimization of Pipelined Primary Caches

The CPU cycle time of a high-performance processor is usually determined by the the access time of the primary cache. As processor speeds increase, designers will have to increase the number of pipeline stages used to fetch data from the cache in order to reduce the dependence of CPU cycle time on cache access time. This paper studies the performance advantages of a pipelined cache for a GaAs implementation of the MIPS based architecture using a design methodology that includes long traces of multiprogrammed applications and detailed timing analysis, The study evaluates instruction and data caches with various pipeline depths, cache sizes, block sizes, and refill penalties. The impact on CPU cycle time of these alternatives is also factored into the evaluation. Hardware-based and software-based strategies are considered for hiding the branch and load delays which may be required to avoid pipeline hazards. The results show that software-based methods for mitigating the penalty of branch delays can be as successful as the hardware-based branch-target buffer approach, despite the code-expansion inherent in the software methods. The situation is similar for load delays; while hardware-based dynamic methods hide more delay cycles than do static approaches, they may give up the advantage by extending the cycle time. Because these methods are quite successful at hiding small numbers of branch and load delays, and because processors with pipelined caches also have shorter CPU cycle times and larger caches, a significant performance advantage is gained by using two to three pipeline stages to fetch data from the cache.

[1] Alan Jay Smith,et al. A Comparative Study of Set Associative Memory Mapping Algorithms and Their Use for Cache and Main Memory , 1978, IEEE Transactions on Software Engineering.

[2] Rajiv V. Joshi,et al. A 2-ns cycle, 3.8-ns access 512-kb CMOS ECL SRAM with a fully pipelined architecture , 1991 .

[3] Alan Jay Smith,et al. Branch Prediction Strategies and Branch Target Buffer Design , 1995, Computer.

[4] Trevor Mudge,et al. Multilevel optimization in the design of a high-performance GaAs microcomputer , 1991 .

[5] D. J. Lalja,et al. Reducing the branch penalty in pipelined processors , 1988, Computer.

[6] Trevor N. Mudge,et al. CheckT/sub c/ and minT/sub c/: timing verification and optimal clocking of synchronous digital circuits , 1990, 1990 IEEE International Conference on Computer-Aided Design. Digest of Technical Papers.

[7] Nestoras Tzartzanis,et al. Reducing the branch penalty by rearranging instructions in a double-width memory , 1991, ASPLOS IV.

[8] Carlo H. Séquin,et al. A VLSI RISC , 1982, Computer.

[9] Fred C. Chow,et al. How many addressing modes are enough , 1987, ASPLOS 1987.