Hardware design of task superscalar architecture

Exploiting concurrency to achieve greater performance is a difficult and important challenge for current high performance systems. Although the theory is plain, the complexity of traditional parallel programming models in most cases impedes the programmer to harvest performance. Several partitioning granularities have been proposed to better exploit concurrency at task granularity. In this sense, different dynamic software task management systems, such as task-based dataflow programming models, benefit dataflow principles to improve task-level parallelism and overcome the limitations of static task management systems. These models implicitly schedule computation and data and use tasks instead of instructions as a basic work unit, thereby relieving the programmer of explicitly managing parallelism. While these programming models share conceptual similarities with the well-known Out-of-Order superscalar pipelines (e.g., dynamic data dependency analysis and dataflow scheduling), they rely on software-based dependency analysis, which is inherently slow, and limits their scalability when there is fine-grained task granularity and a large amount of tasks. The aforementioned problem increases with the number of available cores. In order to keep all the cores busy and accelerate the overall application performance, it becomes necessary to partition it into more and smaller tasks. The task scheduling (i.e., creation and management of the execution of tasks) in software introduces overheads, and so becomes increasingly inefficient with the number of cores. In contrast, a hardware scheduling solution can achieve greater speed-ups as a hardware task scheduler requires fewer cycles than the software version to dispatch a task. The Task Superscalar is a hybrid dataflow/von-Neumann architecture that exploits the task level parallelism of the program. The Task Superscalar combines the effectiveness of Out-of-Order processors together with the task abstraction, and thereby provides an unified management layer for CMPs which effectively employs processors as functional units. The Task Superscalar has been implemented in software with limited parallelism and high memory consumption due to the nature of the software implementation. In this thesis, a Hardware Task Superscalar architecture is designed to be integrated in a future High Performance Computer with the ability to exploit fine-grained task parallelism. The main contributions of this thesis are: (1) a design of the operational flow of Task Superscalar architecture adapted and improved for hardware implementation, (2) a HDL prototype for latency exploration, (3) a full cycle-accurate simulator of the Hardware Task Superscalar (based on the previously obtained latencies), (4) full design space exploration of the Task Superscalar component configuration (number and size) for systems with different number of processing elements (cores), (5) comparison with a software implementation of a real task-based programming model runtime using real benchmarks, and (6) hardware resource usage exploration of the selected configurations.

[1]  Walid A. Najjar,et al.  A quantitative analysis of locality in dataflow programs , 1991, MICRO 24.

[2]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[3]  David E. Culler,et al.  The Explicit Token Store , 1990, J. Parallel Distributed Comput..

[4]  Hiroshi Yasuhara,et al.  DDDP-a Distributed Data Driven Processor , 1983, ISCA '83.

[5]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[6]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[7]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[8]  Ali R. Hurson,et al.  Dataflow architectures and multithreading , 1994, Computer.

[9]  Andrei Sergeevich Terechko,et al.  A Multithreaded Multicore System for Embedded Media Processing , 2011, Trans. High Perform. Embed. Archit. Compil..

[10]  Monica S. Lam,et al.  Jade: a high-level, machine-independent language for parallel programming , 1993, Computer.

[11]  D. Marr,et al.  Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[12]  Mauricio J. Serrano,et al.  Performance estimation of multistreamed, superscalar processors , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[13]  Alejandro Duran,et al.  Productive Cluster Programming with OmpSs , 2011, Euro-Par.

[14]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[15]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[16]  Rainer Leupers,et al.  Task management in MPSoCs: An ASIP approach , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[17]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[18]  Jesús Labarta,et al.  ClusterSs: a task-based programming model for clusters , 2011, HPDC '11.

[19]  J B Dennis The varieties of data flow computers , 1986 .

[20]  Guang R. Gao,et al.  Measurement and modeling of EARTH-MANNA multithreaded architecture , 1996, Proceedings of MASCOTS '96 - 4th International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[21]  Keshav Pingali,et al.  I-structures: Data structures for parallel computing , 1986, Graph Reduction.

[22]  Brian Demsky,et al.  OoOJava: an out-of-order approach to parallel programming , 2010 .

[23]  Walid A. Najjar,et al.  An evaluation of coarse grain dataflow code generation strategies , 1993, Proceedings of Workshop on Programming Models for Massively Parallel Computers.

[24]  T. Sherwood,et al.  Predictor-directed stream buffers , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[25]  Toshitsugu Yuba,et al.  The SIGMA-1 dataflow computer , 1987, FJCC.

[26]  Andrei Sergeevich Terechko,et al.  A Hardware Task Scheduler for Embedded Video Processing , 2008, HiPEAC.

[27]  Juanjo Noguera,et al.  System-level power-performance trade-offs in task scheduling for dynamically reconfigurable architectures , 2003, CASES '03.

[28]  William J. Dally,et al.  Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.

[29]  Peter K. Pearson,et al.  Fast hashing of variable-length text strings , 1990, CACM.

[30]  Magnus Själander,et al.  A Look-Ahead Task Management Unit for Embedded Multi-Core Architectures , 2008, 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools.

[31]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[32]  Theo Ungerer,et al.  The ASTOR Architecture , 1987, ICDCS.

[33]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[34]  Ian Watson,et al.  The Manchester prototype dataflow computer , 1985, CACM.

[35]  Mitsuhisa Sato,et al.  The EM-X parallel computer: architecture and basic performance , 1995, ISCA.

[36]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[37]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[38]  Harry F. Jordan Performance measurements on HEP - a pipelined MIMD computer , 1983, ISCA '83.

[39]  Jean-Luc Gaudiot,et al.  Data-Flow and Multithreaded Architectures , 1999 .

[40]  L. Rauchwerger,et al.  The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization , 1999, IEEE Trans. Parallel Distributed Syst..

[41]  David Chaiken,et al.  Latency Tolerance through Multithreading in Large-Scale Multiprocessors , 1991 .

[42]  Eduard Ayguadé,et al.  Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[43]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[44]  Soo-Ik Chae,et al.  A hardware operating system kernel for multi-processor systems , 2008, IEICE Electron. Express.

[45]  Joseph E. Requa The Piecewise Data Flow architecture control flow and register management , 1983, ISCA '83.

[46]  Dr. Jurij Šilc,et al.  Processor Architecture , 1999, Springer Berlin Heidelberg.

[47]  Kenneth R. Traub,et al.  Multithreading: a revisionist view of dataflow architectures , 1991, ISCA '91.

[48]  Guang R. Gao,et al.  Earth: an efficient architecture for running threads , 1999 .

[49]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[50]  Robert H. Halstead,et al.  Multithreaded Computer Architecture , 1994, The Kluwer International Series in Engineering and Computer Science.

[51]  Eduard Ayguadé,et al.  Nanos mercurium: A research compiler for OpenMP , 2004 .

[52]  Paraskevas Evripidou,et al.  Data-Driven Multithreading Using Conventional Microprocessors , 2006, IEEE Transactions on Parallel and Distributed Systems.

[53]  John von Neumann,et al.  First draft of a report on the EDVAC , 1993, IEEE Annals of the History of Computing.

[54]  Paraskevas Evripidou,et al.  Data Driven Network of Workstations D2NOW) , 2000, J. Univers. Comput. Sci..

[55]  Monica S. Lam,et al.  Coarse-grain parallel programming in Jade , 1991, PPOPP '91.

[56]  Angelos Bilas,et al.  Tagged Procedure Calls (TPC): Efficient Runtime Support for Task-Based Parallelism on the Cell Processor , 2010, HiPEAC.

[57]  Seth Copen Goldstein,et al.  Tartan: evaluating spatial computation for whole program execution , 2006, ASPLOS XII.

[58]  W. Daniel Hillis,et al.  The connection machine , 1985 .

[59]  Juanjo Noguera,et al.  Multitasking on reconfigurable architectures: microarchitecture support and dynamic scheduling , 2004, TECS.

[60]  Wooyoung Kim,et al.  Multicore Desktop Programming with Intel Threading Building Blocks , 2011, IEEE Software.

[61]  Jaehyuk Huh,et al.  TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP , 2004, TACO.

[62]  David E. Culler,et al.  Dataflow architectures , 1986 .

[63]  Ben H. H. Juurlink,et al.  Nexus: Hardware Support for Task-Based Programming , 2011, 2011 14th Euromicro Conference on Digital System Design.

[64]  R. Karp,et al.  Properties of a model for parallel computations: determinacy , 1966 .

[65]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[66]  Roman L. Lysecky,et al.  Configuration Locking and Schedulability Estimation for Reduced Reconfiguration Overheads of Reconfigurable Systems , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[67]  Ute Schürfeld,et al.  The Stollmann Data Flow Machine , 1989, PARLE.

[68]  Paraskevas Evripidou D3-Machine: A decoupled data-driven multithreaded architecture with variable resolution support , 2001, Parallel Comput..

[69]  Vason P. Srini,et al.  An Architectural Comparison of Dataflow Systems , 1986, Computer.

[70]  K. Waldschmidt,et al.  ADARC: a fine grain dataflow architecture with associative communication network , 1994, Proceedings of Twentieth Euromicro Conference. System Architecture and Integration.

[71]  A. L. Davis,et al.  The architecture and system method of DDM1: A recursively structured Data Driven Machine , 1978, ISCA '78.

[72]  John R. Ellis,et al.  Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific) , 1985 .

[73]  Richard P. Hopkins,et al.  Combining Data Flow and Control Flow Computing , 1982, Comput. J..

[74]  Monica S. Lam,et al.  Heterogeneous parallel programming in Jade , 1992, Proceedings Supercomputing '92.

[75]  Paraskevas Evripidou,et al.  A Decoupled Graph/Computation Data-Driven Architecture with Variable-Resolution Actors , 1990, International Conference on Parallel Processing.

[76]  Robert A. Iannucci,et al.  A dataflow/von Neumann hybrid architecture , 1988 .

[77]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '98.

[78]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[79]  Ben H. H. Juurlink,et al.  A Case for Hardware Task Management Support for the StarSS Programming Model , 2010, 2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools.

[80]  Erich Bloch,et al.  The engineering design of the stretch computer , 1959, IRE-AIEE-ACM '59 (Eastern).

[81]  Mike Lee,et al.  Design and Implementation of the POWER5 TM Microprocessor , 2004 .

[82]  Francisco J. Cazorla,et al.  Kilo-instruction processors: overcoming the memory wall , 2005, IEEE Micro.

[83]  A. Crespo,et al.  A hardware scheduler for complex real-time systems , 1999, ISIE '99. Proceedings of the IEEE International Symposium on Industrial Electronics (Cat. No.99TH8465).

[84]  Yoav Etsion,et al.  FPGA-Based Prototype of the Task Superscalar Architecture , 2013 .

[85]  Eduard Ayguadé,et al.  Task Superscalar: An Out-of-Order Task Pipeline , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[86]  V. Gerald Grafe,et al.  The Epsilon-2 Multiprocessor System , 1990, J. Parallel Distributed Comput..

[87]  Guang R. Gao,et al.  A design study of the EARTH multiprocessor , 1995, PACT.

[88]  Arvind,et al.  Two Fundamental Issues in Multiprocessing , 1987, Parallel Computing in Science and Engineering.

[89]  Yoav Etsion,et al.  Hybrid Dataflow/von-Neumann Architectures , 2014, IEEE Transactions on Parallel and Distributed Systems.

[90]  Jesús Labarta,et al.  CellSs: Scheduling techniques to better exploit memory hierarchy , 2009, Sci. Program..

[91]  Brian Demsky,et al.  OoOJava: software out-of-order execution , 2011, PPoPP '11.

[92]  Edward A. Lee,et al.  Advances in the dataflow computational model , 1999, Parallel Comput..

[93]  Jack B. Dennis,et al.  First version of a data flow procedure language , 1974, Symposium on Programming.

[94]  Monica S. Lam,et al.  The design, implementation, and evaluation of Jade , 1998, TOPL.

[95]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[96]  Guang R. Gao,et al.  Quantitive studies of data-locality sensitivity on the EARTH multithreaded architecture: preliminary results , 1996, Proceedings of 3rd International Conference on High Performance Computing (HiPC).

[97]  Guang R. Gao,et al.  A Study of the EARTH-MANNA Multithreaded System , 1996, International Journal of Parallel Programming.

[98]  Jan-Philipp Weiss,et al.  Facing the Multicore-Challenge - Aspects of New Paradigms and Technologies in Parallel Computing [Proceedings of a conference held at Stuttgart, Germany, September 19-21, 2012] , 2013, Facing the Multicore-Challenge.

[99]  Theo Ungerer,et al.  Asynchrony in Parallel Computing: From Dataflow to Multithreading , 2001, Scalable Comput. Pract. Exp..

[100]  Jean-Luc Gaudiot,et al.  The Sisal model of functional programming and its implementation , 1997, Proceedings of IEEE International Symposium on Parallel Algorithms Architecture Synthesis.

[101]  Quinn Jacobson,et al.  Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[102]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture , 2003, IEEE Micro.

[103]  Robert M. Keller,et al.  Data Flow Program Graphs , 1982, Computer.

[104]  Rosa M. Badia Top down programming methodology and tools with StarSs - enabling scalable programming paradigms: extended abstract , 2011, ScalA '11.

[105]  Jesús Labarta,et al.  A high‐productivity task‐based programming model for clusters , 2012, Concurr. Comput. Pract. Exp..

[106]  A. Gupta,et al.  Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results , 1989, ISCA '89.

[107]  Kattamuri Ekanadham,et al.  Incorporating Data Flow Ideas into von Neumann Processors for Parallel Execution , 1987, IEEE Transactions on Computers.

[108]  Eiji Kuno,et al.  The Architecture and Preliminary Evaluation Results of the Experimental Parallel Inference Machine PIM-D , 1986, ISCA.

[109]  Krishna M. Kavi,et al.  Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation , 2001, IEEE Trans. Computers.

[110]  Yale N. Patt,et al.  HPS, a new microarchitecture: rationale and introduction , 1985, MICRO 18.

[111]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[112]  V. G. Grafe,et al.  The Epsilon dataflow processor , 1989, ISCA '89.

[113]  Francesco Regazzoni,et al.  Hardware Scheduling Support in SMP Architectures , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[114]  A. Veidenbaum,et al.  The cedar system and an initial performance study , 1993, ISCA '93.

[115]  Rex W. Vedder,et al.  The Hughes Data Flow Multiprocessor: architecture for efficient signal and data processing , 1985, ISCA 1985.

[116]  Arvind,et al.  T: A Multithreaded Massively Parallel Architecture , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[117]  Michael D. McCool,et al.  Performance evaluation of GPUs using the RapidMind development platform , 2006, SC.

[118]  David E. Culler,et al.  Two Fundamental Limits on Dataflow Multiprocessing , 1993, Architectures and Compilation Techniques for Fine and Medium Grain Parallelism.

[119]  Steven Swanson,et al.  The WaveScalar architecture , 2007, TOCS.

[120]  Derek Chiou,et al.  Performance Studies of Id on the Monsoon Dataflow System , 1993, J. Parallel Distributed Comput..

[121]  Eduard Ayguadé,et al.  Task superscalar: using processors as functional units , 2010 .

[122]  Michael D. McCool,et al.  Programming using RapidMind on the Cell BE , 2006, SC.

[123]  Seth Copen Goldstein,et al.  TAM - A Compiler Controlled Threaded Abstract Machine , 1993, J. Parallel Distributed Comput..

[124]  R. S. Nikhil Can dataflow subsume von Neumann computing? , 1989, ISCA '89.

[125]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[126]  Krishna M. Kavi,et al.  A Formal Definition of Data Flow Graph Models , 1986, IEEE Transactions on Computers.