Scatter-add in data parallel architectures

Many important applications exhibit large amounts of data parallelism, and modern computer systems are designed to take advantage of it. While much of the computation in the multimedia and scientific application domains is data parallel, certain operations require costly serialization that increase the run time. Examples include superposition type updates in scientific computing and histogram computations in media processing. We introduce scatter-add, which is the data-parallel form of the well-known scalar fetch-and-op, specifically tuned for SIMD/vector/stream style memory systems. The scatter-add mechanism scatters a set of data values to a set of memory addresses and adds each data value to each referenced memory location instead of overwriting it. This novel architecture extension allows us to efficiently support data-parallel atomic update computations found in parallel programming languages such as HPF, and applies both to single-processor and multiprocessor SIMD data-parallel systems. We detail the microarchitecture of a scatter-add implementation on a stream architecture, which requires less than 2% increase in die area yet shows performance speedups ranging from 1.45 to over 11 on a set of applications that require a scatter-add computation.

[1]  Leslie Kohn,et al.  Introducing the Intel i860 64-bit microprocessor , 1989, IEEE Micro.

[2]  Marc Tremblay,et al.  VIS speeds new media processing , 1996, IEEE Micro.

[3]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[4]  Dave Shreiner OpenGL Reference Manual: The Official Reference Document to OpenGL, Version 1.2 , 1999 .

[5]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[6]  Eric Darve,et al.  Calculating Free Energies Using a Scaled-Force Molecular Dynamics Algorithm , 2002 .

[7]  Larry Carter,et al.  NAS Benchmarks on the Tera MTA , 1998 .

[8]  Sony’s Emotionally Charged Chip , 1999 .

[9]  Hans P. Zima,et al.  The Earth Simulator , 2004, Parallel Comput..

[10]  Duncan G. Elliott,et al.  Computational Ram: A Memory-simd Hybrid And Its Application To Dsp , 1992, 1992 Proceedings of the IEEE Custom Integrated Circuits Conference.

[11]  William J. Dally,et al.  The VLSI implementation and evaluation of area-and energy-efficient streaming media processors , 2003 .

[12]  Sanjay Ranka,et al.  Array Combining Scatter Functions on Coarse-Grained, Distributed-Memory Parallel Machines , 1998 .

[13]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[14]  Timothy Joe Williams A 3D gyrokinetic particle-in-cell simulation of fusion plasma microturbulence on parallel computers , 1992 .

[15]  Henry G. Dietz,et al.  A case for aggregate networks , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[16]  Jung Ho Ahn,et al.  Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[17]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[18]  Michael Woodacre The SGI® Altix 3000 Global Shared-Memory Architecture , 2003 .

[19]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[20]  Seung-Moon Yoo,et al.  FlexRAM: toward an advanced intelligent memory system , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[21]  Christoforos Kozyrakis,et al.  A Media-Enhanced Vector Architecture for Embedded Memory Systems , 1999 .

[22]  William H. Press,et al.  Numerical Recipes: FORTRAN , 1988 .

[23]  William H. Press,et al.  In: Numerical Recipes in Fortran 90 , 1996 .

[24]  William J. Dally,et al.  Programmable Stream Processors , 2003, Computer.

[25]  R. E. Kessler,et al.  Cray T3D: a new dimension for Cray Research , 1993, Digest of Papers. Compcon Spring.

[26]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[27]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[28]  Alvaro L. G. A. Coutinho,et al.  CLUSTERED EDGE-BY-EDGE PRECONDITIONERS FORNON-SYMMETRIC FINITE ELEMENT EQUATIONSLucia , 1998 .

[29]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[30]  W. Daniel Hillis,et al.  The CM-5 Connection Machine: a scalable supercomputer , 1993, CACM.

[31]  A. Belegundu,et al.  Introduction to Finite Elements in Engineering , 1990 .

[32]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[33]  Richard M. Brown,et al.  The ILLIAC IV Computer , 1968, IEEE Transactions on Computers.

[34]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[35]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[36]  Guy E. Blelloch,et al.  Scan primitives for vector computers , 1990, Proceedings SUPERCOMPUTING '90.

[37]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[38]  Leonid Oliker,et al.  Memory-intensive benchmarks: IRAM vs. cache-based machines , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.