Tradeoff between data-, instruction-, and thread-level parallelism in stream processors

This paper explores the scalability of the Stream Processor architecture along the instruction-, data-, and thread-level parallelism dimensions. We develop detailed VLSI-cost and processor-performance models for a multi-threaded Stream Processor and evaluate the tradeoffs, in both functionality and hardware costs, of mechanisms that exploit the different types of parallelism. We show that the hardware overhead of supporting coarse-grained independent threads of control is 15 -- 86% depending on machine parameters. We also demonstrate that the performance gains provided are of a smaller magnitude for a set of numerical applications. We argue that for stream applications with scalable parallel algorithms the performance is not very sensitive to the control structures used within a large range of area-efficient architectural choices. We evaluate the specific effects on performance of scaling along the different parallelism dimensions and explain the limitations of the ILP, DLP, and TLP hardware mechanisms.

[1]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[2]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, ISCA.

[3]  Iain E. G. Richardson,et al.  H.264 and MPEG-4 Video Compression: Video Coding for Next-Generation Multimedia , 2003 .

[4]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[5]  Jung Ho Ahn,et al.  The Design Space of Data-Parallel Memory Systems , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[6]  Mattan Erez,et al.  Merrimac-high-performance and highly-efficient scientific computing with streams , 2006 .

[7]  Shreekant S. Thakkar,et al.  Internet Streaming SIMD Extensions , 1999, Computer.

[8]  Christoforos E. Kozyrakis,et al.  Overcoming the limitations of conventional vector processors , 2003, ISCA '03.

[9]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[10]  Hunter Scales,et al.  AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[11]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[12]  Yale N. Patt,et al.  One Billion Transistors, One Uniprocessor, One Chip , 1997, Computer.

[13]  Anastasis A. Sofokleous,et al.  Review: H.264 and MPEG-4 Video Compression: Video Coding for Next-generation Multimedia , 2005, Comput. J..

[14]  Fred Weber,et al.  AMD 3DNow! technology: architecture and implementations , 1999, IEEE Micro.

[15]  William J. Dally,et al.  Exploring the VLSI scalability of stream processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[16]  William J. Dally,et al.  Stream register files with indexed access , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[17]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[18]  William J. Dally,et al.  Imagine: Media Processing with Streams , 2001, IEEE Micro.

[19]  S. Asano,et al.  The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[20]  Jung Ho Ahn,et al.  Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[21]  Edward A. Lee,et al.  Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing , 1989, IEEE Transactions on Computers.

[22]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[23]  William J. Dally,et al.  Register organization for media processing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[24]  Jung Ho Ahn,et al.  Memory and control organizations of stream processors , 2007 .

[25]  William J. Dally,et al.  Analysis and Performance Results of a Molecular Modeling Application on Merrimac , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[26]  J. W. Backus,et al.  Can programming be liberated from the von Neumann style , 1977 .

[27]  Quinn Jacobson,et al.  Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[28]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture , 2003, IEEE Micro.

[29]  Christopher Batten,et al.  The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[30]  William J. Dally,et al.  Programmable Stream Processors , 2003, Computer.

[31]  William J. Dally,et al.  Conditional techniques for stream processing kernels , 2004 .

[32]  BackusJohn Can programming be liberated from the von Neumann style , 1978 .

[33]  John W. Backus,et al.  Can programming be liberated from the von Neumann style?: a functional style and its algebra of programs , 1978, CACM.

[34]  Kunle Olukotun,et al.  The Stanford Hydra CMP , 2000, IEEE Micro.

[35]  William J. Dally,et al.  Communication Scheduling , 2000, ASPLOS.

[36]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).