Exploring the VLSI scalability of stream processors

Stream processors are high-performance programmable processors optimized to run media applications. Recent work has shown these processors to be more area- and energy-efficient than conventional programmable architectures. This paper explores the scalability of stream architectures to future VLSI technologies where over a thousand floating-point units on a single chip will be feasible. Two techniques for increasing the number of ALU in a stream processor are presented: intracluster and intercluster scaling. These scaling techniques are shown to be cost-efficient to tens of ALU per cluster and to hundreds of arithmetic clusters. A 640-ALU stream processor with 128 clusters and 5 ALU per cluster is shown to be feasible in 45 nanometer technology, sustaining over 300 GOPS on kernels and providing 15.3/spl times/ of kernel speedup and 8.0/spl times/ of application speedup over a 40-ALU stream processor with a 2% degradation in area per ALU and a 7% degradation in energy dissipated per ALU operation.

[1]  William J. Dally,et al.  Register organization for media processing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[2]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[3]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[4]  William J. Dally,et al.  Digital systems engineering , 1998 .

[5]  John Wawrzynek,et al.  A Streaming Multi-Threaded Model , 2001 .

[6]  William J. Dally,et al.  Media processing applications on the Imagine stream processor , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[7]  John Wawrzynek,et al.  Vector microprocessors , 1998 .

[8]  Brian Kingsbury,et al.  Spert-II: A Vector Microprocessor System , 1996, Computer.

[9]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[10]  Christoforos E. Kozyrakis,et al.  Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks , 2002, MICRO.

[11]  Alan Smith,et al.  Implementation of a third-generation 1.1-GHz 64-bit microprocessor , 2002, IEEE J. Solid State Circuits.

[12]  William J. Dally,et al.  Efficient conditional operations for data-parallel architectures , 2000, MICRO 33.

[13]  David A. Patterson,et al.  Scalable Vector Media-processors for Embedded Systems , 2002 .

[14]  Scott Rixner,et al.  Stream Processor Architecture , 2001 .

[15]  Takeo Kanade,et al.  A stereo machine for video-rate dense depth mapping and its new applications , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[17]  William J. Dally,et al.  Imagine: Media Processing with Streams , 2001, IEEE Micro.

[18]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).