Execution-driven evaluation of cache-coherent shared-memory multiprocessors

Shared-memory multiprocessors are gaining wide acceptability since they are easy to program. Caches and the support for cache coherence are an integral part of shared-memory systems. Large scale systems use directory based coherence protocol. This dissertation deals with the evaluation of large scale cache-coherent shared-memory systems. Execution driven simulation has been used for evaluation because of the complexity of the system model and to facilitate the use of real applications as work-load. The interconnection network is a crucial component of multiprocessor systems. Here, the performance of a two-dimensional torus and a multi-stage network have been evaluated for cache-coherent shared-memory systems. Packet switching and wormhole routing have been considered in our evaluation. Also, virtual channels have been considered for wormhole-routed networks. The networks are evaluated for varying number of virtual channels, amount of buffer space, and number of internal links. The network in cache-coherent systems suffers from hot-spots due to bursty traffic. The effectiveness of adaptive routing techniques in wormhole networks has been investigated in dealing with hot-spots in such systems. A partially-adaptive routing scheme and two fully-adaptive routing schemes have been considered for two-dimensional tori and their performance has been compared against non-adaptive e-cube routing. The distribution of shared memory blocks in the system determines the traffic pattern and affects network performance. Two different memory organizations, high-order interleaving and low-order interleaving, have been considered. The traffic pattern and execution time of several applications have been determined for these memory organizations. For most of the applications, low-order interleaving results in lower execution time due to removal of hot-spots in the network and at memory modules. Another important aspect of this work has been to optimize algorithms for cache-coherent shared-memory systems. A fast Fourier transform (FFT) algorithm optimized for cache based systems has been developed. This algorithm has been compared with the conventional FFT algorithm in terms of the number of cache misses and execution time. The technique used in the new algorithm can be used to optimize other applications for cache based systems.