Execution and Cache Performance of the Scheduled Dataflow Architecture

This paper presents an evaluation of our Scheduled Dataflow (SDF) Processor. Recent focus in the field of new processor architectures is mainly on VLIW (e.g. IA-64), superscalar and superspeculative architectures. This trend allows for better performance at the expense of an increased hardware complexity and a brute-force solution to the memory-wall problem. Our research substantially deviates from this trend by exploring a simpler, yet powerful execution paradigm that is based on dataflow concepts. A program is partitioned into functional execution threads, which are perfectly suited for our non-blocking multithreaded architecture. In addition, all memory accesses are decoupled from the thread's execution. Data is pre-loaded into the thread's context (registers), and all results are post-stored after the completion of the thread's execution. The decoupling of memory accesses from thread execution requires a separate unit to perform the necessary pre-loads and post-stores, and to control the allocation of hardware thread contexts to enabled threads. The analytical analysis of our architecture showed that we could achieve a better performance than other classical dataflow architectures (i.e., ETS), hybrid models (e.g., EARTH) and decoupled multithreaded architectures (e.g., Rhamma processor). This paper analyzes the architecture using an instruction set level simulator for a variety of benchmark programs. We compared the execution cycles required for programs on SDF with the execution cycles required by the programs on DLX (or MIPS). Then we investigated the expected cache-memory performance by collecting address traces from programs and using a trace-driven cache simulator (Dinero-IV). We present these results in this paper.

[1]  Krishna M. Kavi,et al.  A decoupled scheduled dataflow multithreaded architecture , 1999, Proceedings Fourth International Symposium on Parallel Architectures, Algorithms, and Networks (I-SPAN'99).

[2]  Y. Patt,et al.  Single instruction stream parallelism is greater than two , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[3]  Krishna M. Kavi,et al.  Design of cache memories for multi-threaded dataflow architecture , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[4]  Susan J. Eggers,et al.  The effectiveness of multiple hardware contexts , 1994, ASPLOS VI.

[5]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[6]  John Feo,et al.  SISAL reference manual. Language version 2.0 , 1990 .

[7]  Walid A. Najjar,et al.  Control of loop parallelism in multithreaded code , 1995, PACT.

[8]  Mario Tokoro,et al.  On the working set concept for data-flow machines , 1983, ISCA '83.

[9]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[10]  Sharilyn A. Thoreson,et al.  A Feasibility Study of a Memory Hierarchy in a Data Flow Environment , 1985, ICPP.

[11]  Makoto Iwata,et al.  DDMPs: self-timed super-pipelined data-driven multimedia processors , 1999 .

[12]  Krishna M. Kavi,et al.  Design of cache memories for dataflow architecture , 1998, J. Syst. Archit..

[13]  Krishna M. Kavi,et al.  Scheduled dataflow architecture : A synchronous execution paradigm for dataflow , 1999 .

[14]  Theo Ungerer,et al.  A multithreaded processor designed for distributed shared memory systems , 1997, Proceedings. Advances in Parallel and Distributed Computing.

[15]  Kenneth R. Traub,et al.  Multithreading: a revisionist view of dataflow architectures , 1991, ISCA '91.

[16]  Guang R. Gao,et al.  A design study of the EARTH multiprocessor , 1995, PACT.

[17]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[18]  Masaru Takesue A unified resource management and execution control mechanism for data flow machines , 1987, ISCA '87.

[19]  R. Govindarajan,et al.  Design and performance evaluation of a multithreaded architecture , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[20]  Trevor N. Mudge,et al.  A performance comparison of contemporary DRAM architectures , 1999, ISCA.

[21]  Donald Yeung,et al.  Sparcle: an evolutionary processor design for large-scale multiprocessors , 1993, IEEE Micro.

[22]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[23]  Josep Torrellas,et al.  A Chip-Multiprocessor Architecture with Speculative Multithreading , 1999, IEEE Trans. Computers.

[24]  Mitsuhisa Sato,et al.  Super-threading: architectural and software mechanisms for optimizing parallel computation , 1993, ICS '93.

[25]  Jian Huang,et al.  The Superthreaded Processor Architecture , 1999, IEEE Trans. Computers.