Modeling instruction placement on a spatial architecture

In response to current technology scaling trends, architects are developing a new style of processor, known as spatial computers. A spatial computer is composed of hundreds or even thousands of simple, replicated processing elements (or PEs), frequently organized into a grid. Several current spatial computers, such as TRIPS, RAW, SmartMemories, nanoFabrics and WaveScalar, explicitly place a program's instructions onto the grid. Designing instruction placement algorithms is an enormous challenge, as there are an exponential (in the size of the application) number of different mappings of instructions to PEs, and the choice of mapping greatly affects program performance. In this paper we develop an instruction placement performance model which can inform instruction placement. The model comprises three components, each of which captures a different aspect of spatial computing performance: inter-instruction operand latency, data cache coherence overhead, and contention for processing element resources. We evaluate the model on one spatial computer, WaveScalar, and find that predicted and actual performance correlate with a coefficient of -0.90. We demonstrate the model's utility by using it to design a new placement algorithm, which outperforms our previous algorithms. Although developed in the context of WaveScalar, the model can serve as a foundation for tuning code, compiling software, and understanding the microarchitectural trade-offs of spatial computers in general.

[1]  Kenji Nishida,et al.  Evaluation of a Prototype Data Flow Processor of the SIGMA-1 for Scientific Computations , 1986, ISCA.

[2]  Seth Copen Goldstein,et al.  NanoFabrics: spatial computing using molecular electronics , 2001, ISCA 2001.

[3]  Ian Watson,et al.  The Manchester prototype dataflow computer , 1985, CACM.

[4]  Anant Agarwal,et al.  Scalar operand networks: on-chip interconnect for ILP in partitioned architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[5]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[6]  Malgorzata Marek-Sadowska,et al.  Interconnect resource-aware placement for hierarchical FPGAs , 2001, ICCAD.

[7]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[8]  Steven Fortune,et al.  Parallelism in random access machines , 1978, STOC.

[9]  Steven Swanson,et al.  Area-Performance Trade-offs in Tiled Dataflow Architectures , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[10]  Chuan-jie Lin Zpl language reference manual , 1994 .

[11]  A. L. Davis,et al.  The architecture and system method of DDM1: A recursively structured Data Driven Machine , 1978, ISCA '78.

[12]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[13]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[14]  Vaughn Betz,et al.  VPR: A new packing, placement and routing tool for FPGA research , 1997, FPL.

[15]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[16]  David E. Culler,et al.  Resource requirements of dataflow programs , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[17]  James E. Smith,et al.  Modeling superscalar processors via statistical simulation , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[18]  Carl Ebeling,et al.  Mapping applications to the RaPiD configurable architecture , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[19]  David E. Culler,et al.  Assessing the Benefits of Fine- Grain Parallelism in Dataflow Programs , 1988 .

[20]  Toshitsugu Yuba,et al.  An Architecture Of A Dataflow Single Chip Processor , 1989, The 16th Annual International Symposium on Computer Architecture.

[21]  Frederic T. Chong,et al.  HLS: combining statistical and symbolic simulation to guide microprocessor designs , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[22]  Michael D. Smith,et al.  Procedure placement using temporal-ordering information , 1999, TOPL.

[23]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[24]  Steven Swanson,et al.  The WaveScalar architecture , 2007, TOCS.

[25]  V. G. Grafe,et al.  The Epsilon dataflow processor , 1989, ISCA '89.

[26]  Reiner W. Hartenstein,et al.  Field Programmable Logic and Applications , 1999, Lecture Notes in Computer Science.

[27]  Larry Carter,et al.  Hierarchical tiling for improved superscalar performance , 1995, Proceedings of 9th International Parallel Processing Symposium.

[28]  David E. Culler,et al.  Assessing the benefits of fine-grain parallelism in dataflow programs , 1988, Proceedings. SUPERCOMPUTING '88.

[29]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[30]  Lieven Eeckhout,et al.  Performance analysis through synthetic trace generation , 2000, 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422).

[31]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[32]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '75.

[33]  Hiroshi Yasuhara,et al.  DDDP-a Distributed Data Driven Processor , 1983, ISCA '83.

[34]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[35]  Jason Cong,et al.  Multilevel global placement with congestion control , 2003, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[36]  John Paul Shen,et al.  A framework for statistical modeling of superscalar processor performance , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[37]  Daniel P. Lopresti,et al.  Building and using a highly parallel programmable logic array , 1991, Computer.

[38]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '98.

[39]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[40]  Dean M. Tullsen,et al.  Compilation issues for a simultaneous multithreading processor , 1996 .

[41]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[42]  Kenneth R. Traub,et al.  Multithreading: a revisionist view of dataflow architectures , 1991, ISCA '91.

[43]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.