Multi-configuration simulation algorithms for the evaluation of computer architecture designs

In computer architecture design, a number of candidate designs are simulated on representative workloads, and the most satisfactory design in terms of cost and performance is chosen. This simulation process is time-consuming, especially memory hierarchy simulation, and is a bottleneck in architectural design. In this thesis the multi-configuration simulation approach is adopted for speeding up the simulation process. This approach is based on the observation that the behavior of adjacent design configurations is largely similar, and that the similarity may be exploited to reduce simulation work; significant reductions in simulation time are obtained by a synergistic simulation of many design configurations. A suite of multi-configuration simulation algorithms is developed for memory hierarchy simulation. The suite includes (1) An algorithm for set-associative cache simulation based on a new data structure (the generalized binomial tree) which runs about two times faster than earlier algorithms. (2) An algorithm for direct mapped cache simulation based on a novel tag inclusion property which also gives a factor of two improvement over an earlier algorithm. (3) An innovative limited lookahead algorithm with stack repair for simulating the OPT replacement strategy in caches. (4) Novel multi-configuration simulation algorithms for write-buffers. A simulation package, Cheetah, based on these algorithms has been developed and used in the following modeling and optimization studies. First, a new model, the OPT model, is introduced for classifying cache misses. Unlike earlier models, the OPT model accounts for misses resulting from sub-optimal replacement policies used in practical caches. Experimental characterizations based on the OPT model of the cache misses occurring in the SPEC benchmarks are then presented. The results demonstrate that the replacement policy contributes to a significant fraction of cache misses. Second, the hit-miss and reuse behavior of individual load/store instructions of the SPEC benchmarks are profiled. The profiles show that a small number of instructions contribute to a large percentage of the misses. By scheduling the instructions that miss to hide latency, a factor of three improvement is demonstrated for loop-dominated code. By partially controlling cache replacement using the profile information on data reuse up to a 20% reduction in miss ratio is demonstrated.