Implementation and performance evaluation of scheduled dataflow (sdf) architecture

This dissertation presents the implementation (simulated) and evaluation of a nonblocking, decoupled memory/execution, multithreaded architecture known as the Scheduled Dataflow (SDF) architecture. Recent focus in the field of new processor architecture is mainly on Very Long Instruction Word (VLIW) (e.g., Itanium), superscalar and superspeculative designs. This trend allows for better performance at the expense of increased hardware complexity, and possibly higher power expenditures resulting from dynamic instruction scheduling. The SDF system deviates from this trend by exploring a simpler, yet powerful execution paradigm that is based on dataflow, multithreading and decoupling of memory accesses from execution. A program is partitioned into non-blocking execution threads. In addition, all memory accesses are decoupled from the thread's execution. Data is pre-loaded into the thread's context (registers), and all results are post-stored after the completion of the thread's execution. The decoupling of memory accesses from thread execution requires a separate unit to perform the necessary pre-loads and post-stores and to control the allocation of hardware thread contexts to enabled threads. Thus, SDF contains two units called Synchronization Processor (SP) and Execution Processor (EP). Even though multithreading and decoupling are possible with control-flow architecture, the non-blocking and functional nature of the SDF system make it easier to coordinate the memory accesses and execution of a thread, as well as eliminate unnecessary dependencies among instructions. Evaluation is done based on comparing the execution cycles of SDF with the execution cycles of MIPS (DLX simulator) architecture. The SDF simulator can also be easily modified to contain more than a single SP and a single EP. The execution cycles on the SimpleScalar (a superscalar simulator) and VLIW (as facilitated by Trimaran simulator and TMSC6000) architectures are compared with SDF system consisting of multiple SPs and EPs. Our performance comparisons show that the SDF system consistently outperforms MIPS like system. The SDF system also outperforms superscalar and VLIW when the number of functional units (viz., integer and floating point units, or EPs and SPs) exceeds a certain number. The SDF system performance improvements result from multithreading and decoupling. This dissertation relies on an instruction set simulator for the SDF system and hand-coded benchmarks.

[1]  Guang R. Gao An Efficient Hybrid Dataflow Architecture Modle , 1993, J. Parallel Distributed Comput..

[2]  James E. Smith,et al.  Decoupled access/execute computer architectures , 1984, TOCS.

[3]  Scott A. Mahlke,et al.  The Importance of Prepass Code Scheduling for Superscalar and Superpipelined Processors , 1995, IEEE Trans. Computers.

[4]  Jack B. Dennis,et al.  VAL -- A Value-Oriented Algorithmic Language (Preliminary Reference Manual), , 1979 .

[5]  Theo Ungerer,et al.  Towards extremely fast context switching in a block-multithreaded processor , 1996, Proceedings of EUROMICRO 96. 22nd Euromicro Conference. Beyond 2000: Hardware and Software Design Strategies.

[6]  R. Govindarajan,et al.  Design and performance evaluation of a multithreaded architecture , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[7]  Yongwha Chung,et al.  On‐Chip Multiprocessor with Simultaneous Multithreading , 2000 .

[8]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[9]  Ali R. Hurson,et al.  Dataflow architectures and multithreading , 1994, Computer.

[10]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[11]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[12]  Kenneth R. Traub,et al.  Multithreading: a revisionist view of dataflow architectures , 1991, ISCA '91.

[13]  Krishna M. Kavi,et al.  A decoupled scheduled dataflow multithreaded architecture , 1999, Proceedings Fourth International Symposium on Parallel Architectures, Algorithms, and Networks (I-SPAN'99).

[14]  Dean M. Tullsen,et al.  Software-Directed Register Deallocation for Simultaneous Multithreaded Processors , 1999, IEEE Trans. Parallel Distributed Syst..

[15]  James E. Smith,et al.  Instruction-Level Distributed Processing , 2001, Computer.

[16]  Toshinori Sato Quantitative evaluation of pipelining and decoupling a dynamic instruction scheduling mechanism , 2000, J. Syst. Archit..

[17]  Masaru Takesue A unified resource management and execution control mechanism for data flow machines , 1987, ISCA '87.

[18]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '98.

[19]  Walid A. Najjar,et al.  Analysis of communications and overhead reduction in multithreaded execution , 1995, PACT.

[20]  Gregory M. Papadopoulos,et al.  Implementation of a general purpose dataflow multiprocessor , 1991 .

[21]  Robert A. Iannucci Toward a dataflow/von Neumann hybrid architecture , 1988, ISCA '88.

[22]  Susan J. Eggers,et al.  The effectiveness of multiple hardware contexts , 1994, ASPLOS VI.

[23]  Seth Copen Goldstein,et al.  TAM - A Compiler Controlled Threaded Abstract Machine , 1993, J. Parallel Distributed Comput..

[24]  Donald Yeung,et al.  Sparcle: an evolutionary processor design for large-scale multiprocessors , 1993, IEEE Micro.

[25]  Walid A. Najjar,et al.  Control of loop parallelism in multithreaded code , 1995, PACT.

[26]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[27]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[28]  Ian Watson,et al.  A prototype data flow computer with token labelling , 1899 .

[29]  Toshitsugu Yuba,et al.  An Architecture Of A Dataflow Single Chip Processor , 1989, The 16th Annual International Symposium on Computer Architecture.

[30]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[31]  Dean M. Tullsen,et al.  Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading , 1997, TOCS.

[32]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[33]  Josep Torrellas,et al.  A Chip-Multiprocessor Architecture with Speculative Multithreading , 1999, IEEE Trans. Computers.

[34]  Guang R. Gao,et al.  A design study of the EARTH multiprocessor , 1995, PACT.

[35]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[36]  Sharilyn A. Thoreson,et al.  A Feasibility Study of a Memory Hierarchy in a Data Flow Environment , 1985, ICPP.

[37]  Krishna M. Kavi,et al.  Parallel architectures: Cache memories for dataflow systems , 1996, IEEE Parallel & Distributed Technology: Systems & Applications.

[38]  Krishna M. Kavi,et al.  Execution and Cache Performance of the Scheduled Dataflow Architecture , 2000, J. Univers. Comput. Sci..

[39]  Erik R. Altman,et al.  Simulation/evaluation environment for a VLIW processor architecture , 1997, IBM J. Res. Dev..

[40]  David E. Culler,et al.  Analysis of multithreaded architectures for parallel computing , 1990, SPAA '90.

[41]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[42]  Mitsuhisa Sato,et al.  Super-threading: architectural and software mechanisms for optimizing parallel computation , 1993, ICS '93.

[43]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[44]  Krishna M. Kavi,et al.  Design of cache memories for dataflow architecture , 1998, J. Syst. Archit..

[45]  Krishna M. Kavi,et al.  Cache Performance of Scheduled Dataflow Architecture , 2000 .

[46]  John Feo,et al.  SISAL reference manual. Language version 2.0 , 1990 .

[47]  Mario Tokoro,et al.  On the working set concept for data-flow machines , 1983, ISCA '83.

[48]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '75.

[49]  S. Önder,et al.  Superscalar Execution with Direct Data Forwarding , 1998, PACT 1998.

[50]  David E. Culler,et al.  The Explicit Token Store , 1990, J. Parallel Distributed Comput..

[51]  Trevor N. Mudge,et al.  A performance comparison of contemporary DRAM architectures , 1999, ISCA.

[52]  Mauricio J. Serrano,et al.  Performance estimation of multistreamed, superscalar processors , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[53]  D. Tullsen,et al.  ILP versus TLP on SMT , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[54]  V. Gerald Grafe,et al.  The Epsilon-2 Multiprocessor System , 1990, J. Parallel Distributed Comput..

[55]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[56]  Mateo Valero,et al.  Early 21 st Century Processors , 2001 .

[57]  Susan J. Eggers,et al.  Impact of sharing-based thread placement on multithreaded architectures , 1994, ISCA '94.

[58]  Michael Shebanow,et al.  Single instruction stream parallelism is greater than two , 1991, ISCA '91.