A pattern for efficient parallel computation on multicore processors with scalar operand networks

Systolic arrays have long been used to develop custom hardware because they result in designs that are efficient and scalable. Many researchers have explored ways to exploit systolic designs in programmable processors; however, such efforts often result in the simulation of large systolic arrays on a general purpose platforms. While simulation can add flexibility and problem size independence, it comes at a cost of greatly reducing the efficiency of the original systolic approach. This paper presents a pattern for developing parallel programs using systolic designs to execute efficiently (without resorting to simulation) on modern multicore processors featuring scalar operand networks. This pattern provides a compromise solution that can achieve high efficiency and flexibility given appropriate hardware support. Several examples illustrate the application of this pattern to produce parallel implementations of matrix multiplication and convolution.

[1]  David R. Martinez,et al.  High Performance Embedded Computing Handbook , 2007 .

[2]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[3]  Karthikeyan Sankaralingam,et al.  A design space evaluation of grid processor architectures , 2001, MICRO.

[4]  B. Ramakrishna Rau,et al.  Instruction-level parallel processing: History, overview, and perspective , 2005, The Journal of Supercomputing.

[5]  R. Brent,et al.  Computation of the Singular Value Decomposition Using Mesh-Connected Processors , 1983 .

[6]  Thomas R. Gross,et al.  Compilation for a high-performance systolic array , 1986, SIGPLAN '86.

[7]  Arnold L. Rosenberg,et al.  Work-preserving emulations of fixed-connection networks , 1989, STOC '89.

[8]  H. T. Kung Systolic communication , 1988, [1988] Proceedings. International Conference on Systolic Arrays.

[9]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[10]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[11]  H. T. Kung,et al.  Warp architecture and implementation , 1998, ISCA '98.

[12]  James E. Smith,et al.  Decoupled access/execute computer architectures , 1984, TOCS.

[13]  36th International Symposium on Microarchitecture , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[14]  Henry Hoffmann,et al.  Stream Algorithms and Architecture , 2004, J. Instr. Level Parallelism.

[15]  H. T. Kung Warp experience: we can map computations onto a parallel computer efficiently , 1988, ICS '88.

[16]  Kurt Keutzer,et al.  A design pattern language for engineering (parallel) software: merging the PLPP and OPL projects , 2010, ParaPLoP '10.

[17]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[18]  Sivan Toledo,et al.  A survey of out-of-core algorithms in numerical linear algebra , 1999, External Memory Algorithms.

[19]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[20]  John G. McWhirter,et al.  From Bit Level Systolic Arrays to HDTV Processor Chips , 2006, ASAP.

[21]  T. Gross,et al.  !Warp-anatomy of a parallel computing system , 1999, IEEE Concurrency.

[22]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[23]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[24]  W. Daniel Hillis,et al.  The connection machine , 1985 .

[25]  Anant Agarwal,et al.  Scalar operand networks: on-chip interconnect for ILP in partitioned architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[26]  David E. Foulser,et al.  The Saxpy Matrix-1: A General-Purpose Systolic Computer , 1987, Computer.

[27]  Venkatesh Akella,et al.  Synchroscalar: a multiple clock domain, power-aware, tile-based embedded processor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[28]  H. T. Kung,et al.  Architecture of the PSC-a programmable systolic chip , 1983, ISCA '83.

[29]  K. Keutzer,et al.  Our Pattern Language ( OPL ) : A Design Pattern Language for Engineering ( Parallel ) Software , 2009 .

[30]  Christopher Batten,et al.  The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[31]  Thomas R. Gross,et al.  Communication styles for parallel systems , 1994, Computer.

[32]  Sun-Yuan Kung,et al.  WAVEFRONT ARRAY PROCESSOR: ARCHITECTURE, LANGUAGE AND APPLICATIONS. , 1982 .