Performance and Programming Experience on the Tera MTA

The Tera MTA (for \Multithreaded Architecture") computer features a radically new architecture, with hardware support for up to 128 threads per processor, a powerful instruction set, nearly uniform access time to all memory locations, and zero-cost synchronization and swapping between threads of control. Memory access latencies are tolerated by swapping between the threads. Given a multithreaded program with suucient parallelism, the scalable memory system should allow uncommonly good scaling to multiple processors. This paper gives a brief description of the MTA's architecture and a few observations about its programmability, and then presents some performance gures. 1 The Tera MTA Each processor of the Tera MTA has 128 streams, where a stream is hardware that includes a program counter and set of 32 registers. Each stream can be assigned to (at most) one program thread. 1 A stream can issue one instruction, but then must wait at least 21 cycles (the length of the instruction pipeline) before issuing another. However, instructions from diierent streams on the same processor can be pipelined. That is, each cycle the processor selects (fairly) one of the stream that is ready, and issues the next instruction for the thread assigned to that stream. If there are no ready streams, it issues a no-op (called a phantom). The Tera MTA uses a 128-bit VLIW instruction architecture, where each instruction can include one memory operation (either a Load or a Store) plus two other operations. The MTA has no data caches; instead, all memory references proceed through a 3-D torroidal network to the appropriate memory module and back to the issuing processor. This roundtrip that might require 150 or even more cycles. The network and memory system are designed to sustain a throughput of one memory reference per processor per cycle. Given the relatively long latency for each memory request, there will be many memory references \in ight" at any instant of time. Each stream is allowed to have up to eight outstanding memory references. If an application has enough instruction-level parallelism so that no memory reference is needed until eight instructions later, then just 21 stream, each executing an independent thread of instructions, provides the ability to tolerate 21 8 = 168 cycles of memory latency. When there is less ILP available, or the threads interact, more threads may be necessary to keep the processor busy. 1 A thread is a sequence of instruction. Threads are like …

[1]  Larry Carter,et al.  Multi-processor Performance on the Tera MTA , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[2]  Larry Carter,et al.  NAS Benchmarks on the Tera MTA , 1998 .