Parallel simulation and multiple-path execution techniques for chip-multiprocessor architectures
暂无分享,去创建一个
Integrating multiple processing elements onto a single integrated circuit to form a chip-multiprocessor (CMP) has been proposed as a solution to the problem of increased wiring delays between elements of a integrated circuit. This dissertation exploits the architecture of a CMP to both reduce the simulation time required to study such chips and increase the performance of applications running on such a device.
The complexity of parallel systems has increased both the need for comprehensive simulation and the computation time required to perform the simulations. CMP architectures are particularly susceptible to this effect, combining the requirements of a microprocessor simulator with that of a parallel system. In the first part of this dissertation, a portable, distributed simulator for CMPs is developed and presented based on the Message Passing Interface (MPI) that is designed to run on a cluster of workstations. Because the simulator itself is a complex application, microbenchmark-based evaluation is used to compare parallelization algorithms and interconnects for use in the parallel simulator while identifying potential bottlenecks. The best combination is shown to yield speedups of up to 16 on a 9-node cluster of dual-CPU workstations.
The tight coupling of processing units in a CMP allows new forms of parallelism to be exploited. The second part of this dissertation studies multiple-path execution (MPE) on a CMP design to provide speedup on unmodified sequential code by exploring different paths of a conditional branch on separate processors. The impact on MPE performance due to processor complexity and count, cache and branch prediction architecture, processor-to-path allocation strategies, and limited interprocessor communication capabilities is explored. Simulation shows 12.7% speedup of instructions per cycle (IPC) on SPECint95 with up to 33.5% on benchmark components with poor branch prediction accuracy. This level of speedup is achievable on an 8-processor, 8-issue CMP with a simple mesh interconnect with realistic latencies and limited bandwidth.