Architectural Support for the Stream Execution Model on General-Purpose Processors

There has recently been much interest in stream processing, both in industry (e.g., Cell, NVIDIA G80, ATI R580) and academia (e.g., Stanford Merrimac, MIT RAW), with stream programs becoming increasingly popular for both media and more general-purpose computing. Although a special style of programming called stream programming is needed to target these stream architectures, huge performance benefits can be achieved. In this paper, we minimally add architectural features to commodity general-purpose processors (e.g., Intel/AMD) to efficiently support the stream execution model. We design the extensions to reuse existing components of the general-purpose processor hardware as much as possible by investigating low-cost modifications to the CPU caches, hardware prefetcher, and the execution core. With a less than 1% increase in die area along with judicious use of a software runtime system, we can efficiently support stream programming on traditional processor cores. We evaluate our techniques by running scientific applications on a cycle-level simulation system. The results show that our system executes stream programs as efficiently as possible, limited only by the ALU performance and the memory bandwidth needed to feed the ALUs.

[1]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[2]  Daehyun Kim,et al.  Architectural support for uniprocessor and multiprocessor active memory systems , 2004, IEEE Transactions on Computers.

[3]  Sally A. McKee,et al.  A memory controller for improved performance of streamed computations on symmetric multiprocessors , 1996, Proceedings of International Conference on Parallel Processing.

[4]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[5]  Aamer Jaleel,et al.  DRAMsim: a memory system simulator , 2005, CARN.

[6]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[7]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[8]  Jung Ho Ahn,et al.  Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[9]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[10]  M. Horowitz,et al.  The stream virtual machine , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[11]  William J. Dally,et al.  Imagine: Media Processing with Streams , 2001, IEEE Micro.

[12]  James Demmel,et al.  Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[13]  Timothy J. Barth,et al.  High-order methods for computational physics , 1999 .

[14]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[15]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[16]  Michael Gschwind,et al.  Optimizing Compiler for the CELL Processor , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[17]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[18]  Jung Ho Ahn,et al.  The Design Space of Data-Parallel Memory Systems , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[19]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[20]  Wilson C. Hsieh,et al.  Impulse: Memory system support for scientific applications , 1999, Sci. Program..

[21]  Mateo Valero,et al.  Adding a vector unit to a superscalar processor , 1999, ICS '99.

[22]  Krishnan Mahesh,et al.  Large-Eddy Simulation of Reacting Turbulent Flows in Complex Geometries , 2006 .

[23]  William J. Dally,et al.  Exploring the VLSI scalability of stream processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[24]  Timothy J. Barth,et al.  Simplified Discontinuous Galerkin Methods for Systems of Conservation Laws with Convex Extension , 2000 .

[25]  Mendel Rosenblum,et al.  Stream programming on general-purpose processors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[26]  William J. Dally,et al.  Scatter-add in data parallel architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[27]  William J. Dally,et al.  Programmable Stream Processors , 2003, Computer.

[28]  Nathan L. Binkert,et al.  Network-Oriented Full-System Simulation using M5 , 2003 .

[29]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[30]  M. Itskov,et al.  Constitutive model and finite element formulation for large strain elasto-plastic analysis of shells , 1999 .

[31]  Josep Torrellas,et al.  Using a user-level memory thread for correlation prefetching , 2002, ISCA.

[32]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.