Performance measurement and hardware support for message passing in distributed memory multicomputers

In distributed memory multicomputers, synchronization and data sharing are achieved by explicit message passing. Hence, the speed and efficiency of communication are very important in the overall performance of such machines. The goal of this thesis is to reduce the communication overhead by supporting message passing in hardware. The first step of this research has been to investigate the behavior of parallel application programs running on the multicomputers. We have developed a performance measurement environment for hypercubes based on software monitoring. By measuring about a dozen realistic hypercube application programs, it has been found that message destination and length have high temporal and spatial localities, and that two-stage normal distribution is most suitable for modeling the communication and computation workload. By modeling the computation and communication workloads of realistic parallel programs, this study provided us some very important information that will be useful in both analytical and experimental study of multicomputer communication networks. We have designed two hardware devices: a message passing coprocessor (MPC) and a virtual channel router (VCR), to support communication in both hypercube and mesh networks. The MPC supports software caching, process scheduling and buffer management. The VCR supports virtual channels and cached circuits. These two devices can reduce message latency by 5 to 13 times in realistic hypercube applications. The performance is confirmed by simulation using both real communication traces and synthetic benchmarks. We have also investigated some adaptive routing techniques which can cooperate with the VCR to obtain optimal performance in both hypercubes and meshes. Combining adaptive routing with cached circuits and virtual channels, a set of most often used circuits can be maintained in the network, i.e. the network can "adapt" to the applications.