High Performance Architecture using Speculative Threads and Dynamic Memory Management Hardware

With the advances in very large scale integration (VLSI) technology, hundreds of billions of transistors can be packed into a single chip. With the increased hardware budget, how to take advantage of available hardware resources becomes an important research area. Some researchers have shifted from control flow Von-Neumann architecture back to dataflow architecture again in order to explore scalable architectures leading to multi-core systems with several hundreds of processing elements. In this dissertation, I address how the performance of modern processing systems can be improved, while attempting to reduce hardware complexity and energy consumptions. My research described here tackles both central processing unit (CPU) performance and memory subsystem performance. More specifically I will describe my research related to the design of an innovative decoupled multithreaded architecture that can be used in multi-core processor implementations. I also address how memory management functions can be off-loaded from processing pipelines to further improve system performance and eliminate cache pollution caused by runtime management functions.

[1]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[2]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[3]  David E. Culler,et al.  The Explicit Token Store , 1990, J. Parallel Distributed Comput..

[4]  John Feo,et al.  SISAL reference manual. Language version 2.0 , 1990 .

[5]  Sebastien Hily,et al.  Contention on 2nd Level Cache May Limit the Effectiveness of Simultaneous Multithreading , 1997 .

[6]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[7]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[8]  Paul R. Wilson,et al.  Dynamic Storage Allocation: A Survey and Critical Review , 1995, IWMM.

[9]  André Seznec,et al.  Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[10]  S. J. Frank,et al.  Tightly coupled multiprocessor system speeds memory-access times , 1984 .

[11]  James E. Smith Decoupled access/execute architectures , 1982, ISCA 1982.

[12]  Mikko H. Lipasti,et al.  On the value locality of store instructions , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[13]  Theo Ungerer,et al.  A multithreaded processor designed for distributed shared memory systems , 1997, Proceedings. Advances in Parallel and Distributed Computing.

[14]  King-Sun Fu,et al.  Data Coherence Problem in a Multicache System , 1985, IEEE Transactions on Computers.

[15]  James E. Smith,et al.  Decoupled access/execute computer architectures , 1984, TOCS.

[16]  Krishna M. Kavi,et al.  Intelligent memory manager: Reducing cache pollution due to memory management functions , 2006, J. Syst. Archit..

[17]  Rudolf Eigenmann,et al.  Min-cut program decomposition for thread-level speculation , 2004, PLDI '04.

[18]  Larry Rudolph,et al.  Issues related to MIMD shared-memory computers: the NYU ultracomputer approach , 1985, ISCA '85.

[19]  V. Gerald Grafe,et al.  The Epsilon-2 Multiprocessor System , 1990, J. Parallel Distributed Comput..

[20]  Randy H. Katz,et al.  Implementing a cache consistency protocol , 1985, ISCA '85.

[21]  Chen Yang,et al.  A cost-driven compilation framework for speculative parallelization of sequential programs , 2004, PLDI '04.

[22]  Gurindar S. Sohi,et al.  Speculative Versioning Cache , 2001, IEEE Trans. Parallel Distributed Syst..

[23]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[24]  Jenq Kuen Lee,et al.  Compiler support for speculative multithreading architecture with probabilistic points-to analysis , 2003, PPoPP '03.

[25]  Krishna M. Kavi,et al.  Parallelization of DOALL and DOACROSS Loops - A Survey , 1997, Adv. Comput..

[26]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[27]  Paul R. Wilson,et al.  The memory fragmentation problem: solved? , 1998, ISMM '98.

[28]  Josep Torrellas,et al.  Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[29]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[30]  Ron K. Cytron,et al.  Upper bound for defragmenting buddy heaps , 2005, LCTES '05.

[31]  Krishna M. Kavi,et al.  Execution and Cache Performance of the Scheduled Dataflow Architecture , 2000, J. Univers. Comput. Sci..

[32]  Gurindar S. Sohi,et al.  ARB: A Hardware Mechanism for Dynamic Reordering of Memory References , 1996, IEEE Trans. Computers.

[33]  Wei Liu,et al.  Tasking with out-of-order spawn in TLS chip multiprocessors: microarchitecture and compilation , 2005, ICS '05.

[34]  Krishna M. Kavi,et al.  Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation , 2001, IEEE Trans. Computers.

[35]  Alexander V. Veidenbaum,et al.  A Compiler-Assisted Cache Coherence Solution for Multiprcessors , 1986, ICPP.

[36]  Hiroshi Yasuhara,et al.  DDDP-a Distributed Data Driven Processor , 1983, ISCA '83.

[37]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[38]  Ron K. Cytron,et al.  Hardware Support for Fast and Bounded-Time Storage Allocation , 2002 .

[39]  A.R. Hurson,et al.  Cache memories in dataflow architecture , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[40]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[41]  J. Morris Chang,et al.  A High-Performance Memory Allocator for Object-Oriented Systems , 1996, IEEE Trans. Computers.

[42]  Antonio González,et al.  Speculative multithreaded processors , 1998, ICS '98.

[43]  Antonia Zhai,et al.  Compiler optimization of scalar value communication between speculative threads , 2002, ASPLOS X.

[44]  Jian Huang,et al.  The Superthreaded Processor Architecture , 1999, IEEE Trans. Computers.

[45]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[46]  Donald E. Knuth,et al.  The art of computer programming: V.1.: Fundamental algorithms , 1997 .

[47]  Wei Liu,et al.  POSH: a TLS compiler that exploits program structure , 2006, PPoPP '06.

[48]  John R. Gurd,et al.  Manchester data-flow: a progress report , 1992, ICS '92.

[49]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '98.

[50]  Krishna M. Kavi,et al.  Storage Allocation for Real-Time, Embedded Systems , 2001, EMSOFT.

[51]  Jack B. Dennis,et al.  VAL -- A Value-Oriented Algorithmic Language (Preliminary Reference Manual), , 1979 .

[52]  Krishna M. Kavi,et al.  Multithreaded Systems , 1998, Adv. Comput..

[53]  Gregory M. Papadopoulos,et al.  Implementation of a general purpose dataflow multiprocessor , 1991 .

[54]  Bob Iannucci Toward a dataflow/von Neumann hybrid architecture , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[55]  Seth Copen Goldstein,et al.  TAM - A Compiler Controlled Threaded Abstract Machine , 1993, J. Parallel Distributed Comput..

[56]  R. S. Nikhil Can dataflow subsume von Neumann computing? , 1989, ISCA '89.

[57]  IAN WATSON,et al.  A prototype data flow computer with token labelling , 1979, 1979 International Workshop on Managing Requirements Knowledge (MARK).

[58]  Josep Torrellas,et al.  A Chip-Multiprocessor Architecture with Speculative Multithreading , 1999, IEEE Trans. Computers.

[59]  James K. Archibald,et al.  Cache coherence protocols: evaluation using a multiprocessor simulation model , 1986, TOCS.

[60]  H Sunahara,et al.  On the working set concept for dataflow machines: policies and their evaluation , 1986 .

[61]  Kevin P. McAuliffe,et al.  RP3 Processor-Memory Element , 1985, ICPP.

[62]  Janak H. Patel,et al.  A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[63]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture , 2003, IEEE Micro.

[64]  Sadiq M. Sait,et al.  A high-performance hardware-efficient memory allocation technique and design , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[65]  Katherine Yelick,et al.  A Case for Intelligent RAM: IRAM , 1997 .

[66]  Masaru Takesue A unified resource management and execution control mechanism for data flow machines , 1987, ISCA '87.

[67]  Kathryn S. McKinley,et al.  Reconsidering custom memory allocation , 2002, OOPSLA '02.

[68]  Ron Cytron,et al.  Upper bound for defragmenting buddy heaps , 2005, LCTES.

[69]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[70]  Israel Koren,et al.  A data-driven VLSI array for arbitrary algorithms , 1988, Computer.