Massive data sets have radically changed our understanding of how to design efficient algorithms; the streaming paradigm, whether it in terms of number of passes of an external memory algorithm, or the single pass and limited memory of a stream algorithm, appears to be the dominant method for coping with large data. A very different kind of massive computation has had the same effect at the level of the CPU. It has long been observed [Backus 1977] that the traditional Von Neumann-style architecture creates memory bottlenecks between the CPU and main memory, and much of chip design in the past many years has been focused on methods to alleviate this bottleneck, by way of fast memory, caches, prefetching strategies and the like. However, all of this has made the memory bottleneck itself the focus of chip optimization efforts, and has reflected in the amount of real estate on a chip devoted to caching and memory access circuitry, as compared to the ALU itself [Duca et al. 2003]. For compute-intensive operations, this is an unacceptable tradeoff. The most prominent example is that of the computations performed by a graphics card. The operations themselves are very simple, and require very little memory, but require the ability to perform many computations extremely fast and in parallel to whatever degree possible. Inspired in part by dataflow architectures and systolic arrays, the development of graphics chips focused on high computation throughput while sacrificing (to a degree) the generality of a CPU. What has resulted is a stream processor that is highly optimized for stream computations. Today’s GPUs (graphics processing units) can process over 50 million triangles and 4 billion pixels in one second. Their “Moore’s Law” is faster than that for CPUs, owing primarily to their stream architecture which enables all additional transistors to be devoted to increasing computational power directly. An intriguing side effect of this is the growing use of a graphics card as a general purpose stream processing engine. In an ever-increasing array of applications, researchers are discovering that performing a computation on a graphics card is far faster than performing it on a CPU, and so are using a GPU as a stream co-processor. Another feature that makes the graphics pipeline attractive (and distinguishes it from other stream architectures) is the spatial parallelism it provides. Conceptually, each pixel on the screen can be viewed as a stream processor, potentially giving a large degree of parallelism.
[1]
John W. Backus,et al.
Can programming be liberated from the von Neumann style?: a functional style and its algebra of programs
,
1978,
CACM.
[2]
Donald S. Fussell,et al.
On the power of the frame buffer
,
1988,
TOGS.
[3]
V. Michael Bove,et al.
Cheops: a reconfigurable data-flow system for video processing
,
1995,
IEEE Trans. Circuits Syst. Video Technol..
[4]
Dinesh Manocha,et al.
Fast computation of generalized Voronoi diagrams using graphics hardware
,
1999,
SIGGRAPH.
[5]
Mahesh Viswanathan,et al.
An Approximate L1-Difference Algorithm for Massive Data Streams
,
2002,
SIAM J. Comput..
[6]
Pat Hanrahan,et al.
Efficient partitioning of fragment shaders for multipass rendering on programmable graphics hardware
,
2002,
HWWS '02.
[7]
William J. Dally,et al.
The Imagine Stream Processor
,
2002,
Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.
[8]
Divyakant Agrawal,et al.
Hardware acceleration for spatial selections and joins
,
2003,
SIGMOD '03.
[9]
Kamesh Munagala,et al.
The Power of a Two-sided Depth Test and its Application to CSG Rendering and Depth Extraction
,
2003
.
[10]
Pat Hanrahan,et al.
Data Parallel Computation on Graphics Hardware
,
2003
.