Special Issue: GPU computing

The combined hurdles of power consumption, limited instruction-level parallelism, and memory latency have led hardware manufacturers to design power aware multi-core processors and specialized many-core hardware accelerators in order to further exploit the increasing number of transistors dictated by Moore’s Law. The race is open: As of today, Intel’s top-of-the-line designs include 8 cores in its Xeon Nehalem architecture, AMD raises this number to 12 cores in the Opteron Magny-Cours processor, and the road-maps of the two companies indicate that these quantities will increase to 10–12 (Intel) and 16 (AMD) cores in 2011. At the same time, specialized hardware architectures, such as graphics processing units (GPUs), with tens of cores are already widely deployed. Core numbers are only one level of parallelism that is on the rise. Each core in current processors contains multiple processing elements enabling parallel processing within it. Power efficiency requires that the parallel processing elements are assembled in SIMD (Single Instruction Multiple Data) units. In the current CPU cores SSE instructions are supported enabling up to 4 parallel multiply-add operations on single precision floating point numbers. Soon this figure will rise to 8 with the AVX instruction set and the Intel Larrabee design featured already 16-wide SIMD units in each core. The number of instructions that can be executed in parallel on each GPU core varies strongly between 16 and 80 because of the different designs of GPU cores by different manufacturers; e.g. NVIDIA Fermi architecture has up to 16 cores times 32 processing elements, whereas AMD Cypress architecture contains up to 20 cores times 80 processing elements, in both cases 2× increase over the previous generation. However, counting the processing elements on a GPU allows only a comparison within the same GPU family. Across families at least the differing factors of shader clock and arrangement of the processing elements must be taken into account. Although these new multi-core and many-core architectures can potentially deliver a revolutionary boost in raw performance, the efficient utilization of the growing SIMD and many-core parallelism is the key that will determine their success or failure. In this line, the recent advances in the hardware, functionality, and programmability of graphics processors (GPUs) have greatly increased their appeal as add-on co-processors for general-purpose computing. With the involvement of the largest processor manufacturers, NVIDIA, AMD, and Intel, and the strong interest from researchers of various disciplines, this approach has moved from a research niche to a forwardlooking technique for heterogeneous parallel computing. Scientific and industry researchers are constantly finding new applications for GPUs in a wide variety of areas, including image and video processing, molecular dynamics, seismic simulation, computational biology and chemistry, fluid dynamics, weather forecast, computational finance, quantum physics, and many others. GPU hardware has evolved over many years from graphics pipelines with many heterogeneous fixed-function components over partially programmable architectures toward a more homogeneous general-purpose design (though some fixed-function hardware has remained because of its efficiency). The general-purpose computing on GPU (GPGPU) revolution started with programmable shaders. NVIDIA Compute Unified Device Architecture (CUDA) and, to a smaller extent, AMD CAL/Brook+ have brought GPUs into the mainstream of computing, developing what has been recently coined GPU computing∗. The great advantage of CUDA is that it defines an abstraction that presents the underlying hardware architecture as a sea of hundreds of fine-grained computational units with synchronization primitives on multiple levels. With OpenCL, there is now also a vendor-independent high-level parallel programming language and an application programming interface that offers the same type of hardware abstraction.