Stream programming on general-purpose processors

In this paper we investigate mapping stream programs (i.e., programs written in a streaming style for streaming architectures such as Imagine and Raw) onto a general-purpose CPU. We develop and explore a novel way of mapping these programs onto the CPU. We show how the salient features of stream programming such as computation kernels, local memories, and asynchronous bulk memory loads and stores can be easily mapped by a simple compilation system to CPU features such as the processor caches, simultaneous multithreading, and fast inter-thread communication support, resulting in an executable that efficiently uses CPU resources. We present an evaluation of our mapping on a hyper-threaded Intel Pentium 4 CPU as a canonical example of a general-purpose processor. We compare the mapped stream program against the same program coded in a more conventional style for the general-purpose processor. Using both microbenchmarks and scientific applications we show that programs written in a streaming style can run comparably to equivalent programs written in a traditional style. Our results show that coding programs in a streaming style can improve performance on today's machines and smooth the way for significant performance improvements with the deployment of streaming architectures.

[1]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[2]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[3]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[4]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[5]  David A. Koufaty,et al.  Hyperthreading Technology in the Netburst Microarchitecture , 2003, IEEE Micro.

[6]  Yannis Kallinderis,et al.  Generic parallel adaptive-grid Navier-Stokes algorithm , 1994 .

[7]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[8]  D. Marr,et al.  Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[9]  Wilson C. Hsieh,et al.  Impulse: Memory system support for scientific applications , 1999, Sci. Program..

[10]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[11]  M. Horowitz,et al.  The stream virtual machine , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[12]  Eduard Ayguadé,et al.  A novel renaming mechanism that boosts software prefetching , 2001, ICS '01.

[13]  Monica S. Lam,et al.  Maximizing parallelism and minimizing synchronization with affine transforms , 1997, POPL '97.

[14]  Mahmut T. Kandemir,et al.  Improving locality using loop and data transformations in an integrated framework , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[15]  Christopher Hughes,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.

[16]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[17]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[18]  Juan J. Alonso,et al.  StreamFLO: an Euler solver for streaming architectures , 2004 .

[19]  Jung Ho Ahn,et al.  Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[20]  Edward A. Lee,et al.  Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing , 1989, IEEE Transactions on Computers.

[21]  James R. Larus,et al.  Making Pointer-Based Data Structures Cache Conscious , 2000, Computer.

[22]  William J. Dally,et al.  Media processing applications on the Imagine stream processor , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[23]  Peter Mattson,et al.  A programming system for the imagine media processor , 2002 .

[24]  Tarek A. El-Ghazawi,et al.  On the Enhancements of a Sparse Matrix Information Retrieval Approach , 2000, PDPTA.

[25]  William J. Dally,et al.  The Imagine Stream Processor , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[26]  Donald Yeung,et al.  Physical experimentation with prefetching helper threads on Intel's hyper-threaded processors , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[27]  Timothy J. Barth,et al.  Simplified Discontinuous Galerkin Methods for Systems of Conservation Laws with Convex Extension , 2000 .