Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation

In this paper, the scheduled dataflow (SDF) architecture-a decoupled memory/execution, multithreaded architecture using nonblocking threads-is presented in detail and evaluated against superscalar architecture. Recent focus in the field of new processor architectures is mainly on VLIW (e.g., IA-64), superscalar, and superspeculative designs. This trend allows for better performance, but at the expense of increased hardware complexity and, possibly, higher power expenditures resulting from dynamic instruction scheduling. Our research deviates from this trend by exploring a simpler, yet powerful execution paradigm that is based on dataflow and multithreading. A program is partitioned into nonblocking execution threads. In addition, all memory accesses are decoupled from the thread's execution. Data is preloaded into the thread's context (registers) and all results are poststored after the completion of the thread's execution. While multithreading and decoupling are possible with control-flow architectures, SDF makes it easier to coordinate the memory accesses and execution of a thread, as well as eliminate unnecessary dependencies among instructions. We have compared the execution cycles required for programs on SDF with the execution cycles required by programs on SimpleScalar (a superscalar simulator) by considering the essential aspects of these architectures in order to have a fair comparison. The results show that SDF architecture can outperform the superscalar. SDF performance scales better with the number of functional units and allows for a good exploitation of Thread Level Parallelism (TLP) and available chip area.

[1]  David E. Culler,et al.  The Explicit Token Store , 1990, J. Parallel Distributed Comput..

[2]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[3]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[4]  Michael Shebanow,et al.  Single instruction stream parallelism is greater than two , 1991, ISCA '91.

[5]  Krishna M. Kavi,et al.  Execution and Cache Performance of the Scheduled Dataflow Architecture , 2000, J. Univers. Comput. Sci..

[6]  Keshav Pingali,et al.  I-structures: data structures for parallel computing , 1986, Graph Reduction.

[7]  S. Önder,et al.  Superscalar Execution with Direct Data Forwarding , 1998, PACT 1998.

[8]  Robert A. Iannucci Toward a dataflow/von Neumann hybrid architecture , 1988, ISCA '88.

[9]  Theo Ungerer,et al.  A multithreaded processor designed for distributed shared memory systems , 1997, Proceedings. Advances in Parallel and Distributed Computing.

[10]  Kenneth R. Traub,et al.  Multithreading: a revisionist view of dataflow architectures , 1991, ISCA '91.

[11]  Masato Edahiro,et al.  A Single-Chip Multiprocessor for Smart Terminals , 2000, IEEE Micro.

[12]  R. Govindarajan,et al.  Design and performance evaluation of a multithreaded architecture , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[13]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[14]  Mitsuhisa Sato,et al.  Super-threading: architectural and software mechanisms for optimizing parallel computation , 1993, ICS '93.

[15]  Dean M. Tullsen,et al.  Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading , 1997, TOCS.

[16]  Masaru Takesue A unified resource management and execution control mechanism for data flow machines , 1987, ISCA '87.

[17]  Donald Yeung,et al.  Sparcle: an evolutionary processor design for large-scale multiprocessors , 1993, IEEE Micro.

[18]  Guang R. Gao,et al.  A design study of the EARTH multiprocessor , 1995, PACT.

[19]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[20]  原田 秀逸 私の computer 環境 , 1998 .

[21]  Walid A. Najjar,et al.  Control of loop parallelism in multithreaded code , 1995, PACT.

[22]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[23]  Rajiv Gupta,et al.  Superscalar execution with dynamic data forwarding , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[24]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[25]  John Feo,et al.  SISAL reference manual. Language version 2.0 , 1990 .

[26]  Mario Tokoro,et al.  On the working set concept for data-flow machines , 1983, ISCA '83.

[27]  Krishna M. Kavi,et al.  Scheduled dataflow architecture : A synchronous execution paradigm for dataflow , 1999 .

[28]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[29]  Josep Torrellas,et al.  A Chip-Multiprocessor Architecture with Speculative Multithreading , 1999, IEEE Trans. Computers.

[30]  Gregory M. Papadopoulos,et al.  Implementation of a general purpose dataflow multiprocessor , 1991 .

[31]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[32]  Susan J. Eggers,et al.  The effectiveness of multiple hardware contexts , 1994, ASPLOS VI.

[33]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[34]  Jack B. Dennis,et al.  Data Flow Supercomputers , 1980, Computer.

[35]  Josep Torrellas,et al.  A Chip-Multiprocessor Architecture with , 1999 .

[36]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[37]  Jian Huang,et al.  The Superthreaded Processor Architecture , 1999, IEEE Trans. Computers.

[38]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[39]  Sharilyn A. Thoreson,et al.  A Feasibility Study of a Memory Hierarchy in a Data Flow Environment , 1985, ICPP.

[40]  Keshav Pingali,et al.  I-structures: Data structures for parallel computing , 1986, Graph Reduction.

[41]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[42]  Makoto Iwata,et al.  DDMPs: self-timed super-pipelined data-driven multimedia processors , 1999 .

[43]  Bob Iannucci Toward a dataflow/von Neumann hybrid architecture , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[44]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.