Performance of the CRAY T3E Multiprocessor

The CRAY T3E is a scalable shared-memory multiprocessor based on the DEC Alpha 21164 microprocessor. The system includes a number of architectural features designed to tolerate latency and enhance scalability. Included among these are stream buffers, which detect and prefetch down small-stride reference streams, E-registers, which allow memory reference pipelining and provide non-unit-stride access capabilities, and a scalable, high-bandwidth interconnection network. We report our experiences with T3E performance. We describe several hardware features, discuss programming implications, and provide related benchmark results. Included are NAS Parallel Benchmark results up to 1024 processors.