Memory and Control Organizations of Stream Processors a Dissertation Submitted to the Department of Electrical Engineering and the Committee on Graduate Studies of Stanford University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

The increasing importance of numerical applications and the properties of modern VLSI processes have led to a resurgence in the development of architectures with a large number of ALUs, multiple memory channels, and extensive support for parallelism. In particular, stream processors achieve areaand energy-efficient high performance by relying on the abundant parallelism, multiple levels of locality, and predictability of data accesses common to media, signal processing, and scientific application domains. This thesis explores the memory and control organizations of stream processors. We first study the design space of streaming memory systems in light of the trends of modern DRAMs – increasing concurrency, latency, and sensitivity to access patterns. From a detailed performance analysis using benchmarks with various DRAM parameters and memory-system configurations, we identify read/write turnaround penalties and internal bank conflicts in memory-access threads as the most critical factors affecting performance. Then we present hardware techniques developed to maximize the sustained memory system throughput. Since stream processors heavily rely on parallelism for high performance, certain operations requiring serialization can significantly hurt performance. This can be observed in superposition type updates and histogram computation, which suffer from the memory collision problem. We introduce and detail scatter-add, the data-parallel form of the scalar fetch-and-op, which solves this problem by guaranteeing the atomicity of data accumulation with a memory system. Then we explore the scalability of the stream processor architecture along the instruction, data, and thread level parallelism dimensions. We develop VLSI cost and performance models for a multi-threaded processor in order to study the tradeoffs in functionality and

[1]  C. Radke International Conference on Computer Design , 2022 .

[2]  William J. Dally,et al.  Compiling for stream processing , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  William H. Press,et al.  In: Numerical Recipes in Fortran 90 , 1996 .

[4]  William J. Dally,et al.  Programmable Stream Processors , 2003, Computer.

[5]  Mateo Valero,et al.  Command vector memory systems: high performance at low cost , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[6]  Iain E. G. Richardson,et al.  H.264 and MPEG-4 Video Compression: Video Coding for Next-Generation Multimedia , 2003 .

[7]  Sally A. McKee,et al.  Access order and effective bandwidth for streams on a Direct Rambus memory , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[8]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[9]  James Laudon,et al.  The SGI Origin: A ccNUMA Highly Scalable Server , 1997, ISCA.

[10]  Ralph Grishman,et al.  The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.

[11]  Larry Carter,et al.  NAS Benchmarks on the Tera MTA , 1998 .

[12]  Guy E. Blelloch,et al.  Scan primitives for vector computers , 1990, Proceedings SUPERCOMPUTING '90.

[13]  William J. Dally,et al.  Evaluating the Imagine stream architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[14]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[15]  S. Asano,et al.  The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[16]  A. Belegundu,et al.  Introduction to Finite Elements in Engineering , 1990 .

[17]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[18]  James Smith,et al.  A Simulation Study of the CRAY X-MP Memory System , 1986, IEEE Transactions on Computers.

[19]  Mattan Erez,et al.  Merrimac-high-performance and highly-efficient scientific computing with streams , 2006 .

[20]  John D. Owens,et al.  Computer graphics on a stream architecture , 2002 .

[21]  Quinn Jacobson,et al.  Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[22]  Leonid Oliker,et al.  Memory-intensive benchmarks: IRAM vs. cache-based machines , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[23]  Shreekant S. Thakkar,et al.  Internet Streaming SIMD Extensions , 1999, Computer.

[24]  Yale N. Patt,et al.  One Billion Transistors, One Uniprocessor, One Chip , 1997, Computer.

[25]  Timothy Joe Williams A 3D gyrokinetic particle-in-cell simulation of fusion plasma microturbulence on parallel computers , 1992 .

[26]  B. Ramakrishna Rau,et al.  Pseudo-randomly interleaved memory , 1991, ISCA '91.

[27]  Anastasis A. Sofokleous,et al.  Review: H.264 and MPEG-4 Video Compression: Video Coding for Next-generation Multimedia , 2005, Comput. J..

[28]  Henry G. Dietz,et al.  A case for aggregate networks , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[29]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[30]  William J. Dally,et al.  Data parallel address architecture , 2006, IEEE Computer Architecture Letters.

[31]  Pat Hanrahan,et al.  A real-time procedural shading system for programmable graphics hardware , 2001, SIGGRAPH.

[32]  Fred Weber,et al.  AMD 3DNow! technology: architecture and implementations , 1999, IEEE Micro.

[33]  William J. Dally,et al.  Exploring the VLSI scalability of stream processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[34]  J. Little A Proof for the Queuing Formula: L = λW , 1961 .

[35]  Duncan G. Elliott,et al.  Computational Ram: A Memory-simd Hybrid And Its Application To Dsp , 1992, 1992 Proceedings of the IEEE Custom Integrated Circuits Conference.

[36]  William J. Dally,et al.  Stream register files with indexed access , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[37]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[38]  M. Horowitz,et al.  The stream virtual machine , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[39]  William J. Dally,et al.  Imagine: Media Processing with Streams , 2001, IEEE Micro.

[40]  Dave Shreiner OpenGL Reference Manual: The Official Reference Document to OpenGL, Version 1.2 , 1999 .

[41]  William J. Dally,et al.  Fault Tolerance Techniques for the Merrimac Streaming Supercomputer , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[42]  R. E. Kessler,et al.  Cray T3D: a new dimension for Cray Research , 1993, Digest of Papers. Compcon Spring.

[43]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[44]  Christopher C. Hsiung,et al.  Cray X-MP: the birth of a supercomputer , 1989, Computer.

[45]  Jung Ho Ahn,et al.  Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[46]  Edward A. Lee,et al.  Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing , 1989, IEEE Transactions on Computers.

[47]  Eric Darve,et al.  Calculating Free Energies Using a Scaled-Force Molecular Dynamics Algorithm , 2002 .

[48]  Trevor Mudge,et al.  Modern dram architectures , 2001 .

[49]  Charles Clos,et al.  A study of non-blocking switching networks , 1953 .

[50]  Christopher Batten,et al.  The Vector-Thread Architecture , 2004, ISCA 2004.

[51]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[52]  Leslie Kohn,et al.  Introducing the Intel i860 64-bit microprocessor , 1989, IEEE Micro.

[53]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[54]  David A. Patterson,et al.  Scalable Vector Media-processors for Embedded Systems , 2002 .

[55]  William J. Dally,et al.  Microarchitecture of a high radix router , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[56]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[57]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[58]  Mendel Rosenblum,et al.  Stream programming on general-purpose processors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[59]  Michael Woodacre The SGI® Altix 3000 Global Shared-Memory Architecture , 2003 .

[60]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[61]  Kunle Olukotun,et al.  The Stanford Hydra CMP , 2000, IEEE Micro.

[62]  Ronald T. Williams,et al.  RT_STAP: Real-Time Space-Time Adaptive Processing Benchmark , 1997 .

[63]  William J. Dally,et al.  Register organization for media processing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[64]  Alvaro L. G. A. Coutinho,et al.  CLUSTERED EDGE-BY-EDGE PRECONDITIONERS FORNON-SYMMETRIC FINITE ELEMENT EQUATIONSLucia , 1998 .

[65]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[66]  W. Daniel Hillis,et al.  The CM-5 Connection Machine: a scalable supercomputer , 1993, CACM.

[67]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[68]  William J. Dally,et al.  Analysis and Performance Results of a Molecular Modeling Application on Merrimac , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[69]  J. W. Backus,et al.  Can programming be liberated from the von Neumann style , 1977 .

[70]  W. Dally,et al.  Communication scheduling , 2000, SIGP.

[71]  Jung Ho Ahn,et al.  The Design Space of Data-Parallel Memory Systems , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[72]  Christoforos E. Kozyrakis,et al.  Overcoming the limitations of conventional vector processors , 2003, ISCA '03.

[73]  Hunter Scales,et al.  AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[74]  B. Flachs,et al.  A streaming processing unit for a CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[75]  Norman P. Jouppi,et al.  The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays , 2002, ISCA.

[76]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[77]  William J. Dally,et al.  Conditional techniques for stream processing kernels , 2004 .

[78]  Sanjay Ranka,et al.  Array Combining Scatter Functions on Coarse-Grained, Distributed-Memory Parallel Machines , 1998 .

[79]  William J. Dally,et al.  Scatter-add in data parallel architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[80]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[81]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[82]  Gurindar S. Sohi High-Bandwidth Interleaved Memories for Vector Processors-A Simulation Study , 1993, IEEE Trans. Computers.