论文信息 - An architectural evaluation of parallel systems extended with reconfigurable hardware

An architectural evaluation of parallel systems extended with reconfigurable hardware

The availability of high density FPGA chips and breakthroughs in software compilation techniques enable a new class of parallel systems with user reconfigurable hardware. One early example is the SRC-6 system, a cluster-based parallel machine that connects a set of reconfigurable hardware units and external shared memory through propietary crossbar switches. This dissertation investigates the performance improvement provided by reconfigurable hardware enhancements to communication and computation operations of parallel applications. The ability to access the external shared memory by the reconfigurable hardware units through direct memory access engines on the SRC-6 system provides an opportunity to investigate the effectiveness of leveraging additional computational resources for collective communication. We implement the Reduce operation of the Message Passing Interface Standard on the SRC-6 system using the reconfigurable hardware units to perform combining while the data is communicated via external shared memory. Results show that this implementation can out-perform the processor-based solution, but only with large problem sizes. Moreover, the use of external shared memory units enables the reconfigurable hardware implementation to tolerate application load imbalance through dynamic re-ordering of the combining process. Results show that this implementation can tolerate entry load imbalance up to the duration of a load-balanced Reduce operation. Applications can obtain performance improvement by off-loading the computation kernels that are characterized by intensive mathematical operations and regular control structure to numerous function units in the reconfigurable hardware. We implement the general matrix multiplication routine (GEMM) of the Basic Linear Algebra Subroutines suite using the reconfigurable hardware on the SRC-6 system. Results show that the pipelined implementation of the GEMM operation can achieve 11.2 GFLOPS at 100 MHz with a 8192 x 8192 problem size using a single reconfigurable hardware unit. We present a detailed account of architectural factors that affect the performance of the GEMM implementation.

David E. Culler | Frederick C. Wong