Throughput computing

A qualitative change in the scaling of semiconductor technology has ended the performance scaling of the single-thread processors that have been used as the building blocks for high-performance computers for the last decade and has made computers of all scales power limited. In today's power-limited regime, efficient high-performance computers must be built from throughput processors, processors, like GPUs, that are optimized for sustained performance per unit power --- rather than for single-thread performance. This talk will discuss some of the challenges and opportunities in the architecture and programming of future throughput processors. In these processors, performance derives from parallelism and efficiency derives from locality. Parallelism can take advantage of the plentiful and inexpensive arithmetic units in a throughput processor. Without locality, however, bandwidth quickly becomes a bottleneck. Communication bandwidth, not arithmetic is the critical resource in a modern computing system that dominates cost, performance, and power. This talk will discuss exploitation of parallelism and locality with examples drawn from the Imagine and Merrimac projects, from NVIDIA GPUs, and from three generations of stream programming systems.