论文信息 - Data-Driven Concurrency for High Performance Computing

Data-Driven Concurrency for High Performance Computing

In this work, we utilize dynamic dataflow/data-driven techniques to improve the performance of high performance computing (HPC) systems. The proposed techniques are implemented and evaluated through an efficient, portable, and robust programming framework that enables data-driven concurrency on HPC systems. The proposed framework is based on data-driven multithreading (DDM), a hybrid control-flow/dataflow model that schedules threads based on data availability on sequential processors. The proposed framework was evaluated using several benchmarks, with different characteristics, on two different systems: a 4-node AMD system with a total of 128 cores and a 64-node Intel HPC system with a total of 768 cores. The performance evaluation shows that the proposed framework scales well and tolerates scheduling overheads and memory latencies effectively. We also compare our framework to MPI, DDM-VM, and OmpSs@Cluster. The comparison results show that the proposed framework obtains comparable or better performance.

Paraskevas Evripidou | George Matheou | P. Evripidou | George Matheou

[1] Ian Watson,et al. The Manchester prototype dataflow computer , 1985, CACM.

[2] Dan Bonachea. GASNet Specification, v1.1 , 2002 .

[3] Vítor Santos Costa,et al. Trebuchet: exploring TLP with dataflow virtualisation , 2011, Int. J. High Perform. Syst. Archit..

[4] J. Demmel,et al. Sun Microsystems , 1996 .

[5] Philippe Olivier Alexandre Navaux,et al. Challenges and Issues of Supporting Task Parallelism in MPI , 2010, EuroMPI.

[6] Ali R. Hurson,et al. Dataflow architectures and multithreading , 1994, Computer.

[7] Paraskevas Evripidou,et al. Data-flow Concurrency on Distributed Multi-core Systems , 2013 .

[8] Marco Danelutto,et al. FastFlow: High-level and Efficient Streaming on Multi-core , 2017 .

[9] Wei Ge,et al. The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[10] Josep Torrellas,et al. Data forwarding in scalable shared-memory multiprocessors , 1995, ICS '95.

[11] Samuel H. Fuller,et al. Computing Performance: Game Over or Next Level? , 2011, Computer.

[12] Kathleen Knobe,et al. Concurrent Collections on Distributed Memory Theory Put into Practice , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[13] Samer Arandi,et al. The data-driven multithreading virtual machine , 2012 .

[14] Eduard Ayguadé,et al. Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..

[15] Hartmut Kaiser,et al. HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.

[16] Paraskevas Evripidou,et al. Architectural Support for Data-Driven Execution , 2015, ACM Trans. Archit. Code Optim..

[17] Paraskevas Evripidou,et al. DDM-VMc: the data-driven multithreading virtual machine for the cell processor , 2011, HiPEAC.

[18] James Reinders,et al. Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[19] Margaret H. Wright,et al. The opportunities and challenges of exascale computing , 2010 .

[20] Pen-Chung Yew,et al. Data Prefetching and Data Forwarding in Shared Memory Multiprocessors , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[21] Christina Freytag,et al. Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[22] Paraskevas Evripidou,et al. Data-Driven Multithreading Using Conventional Microprocessors , 2006, IEEE Transactions on Parallel and Distributed Systems.

[23] Paraskevas Evripidou,et al. Verilog-based simulation of hardware support for data-flow concurrency on multicore systems , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[24] Julien Langou,et al. A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[25] Jack B. Dennis,et al. First version of a data flow procedure language , 1974, Symposium on Programming.

[26] Thomas Hérault,et al. Scalable Dense Linear Algebra on Heterogeneous Hardware , 2012, High Performance Computing Workshop.

[27] Tracy Camp,et al. A taxonomy of distributed termination detection algorithms , 1998, J. Syst. Softw..

[28] Anthony Skjellum,et al. Using MPI: Portable Programming with the Message-Passing Interface , 1999 .

[29] Bradford L. Chamberlain,et al. Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[30] Guang R. Gao,et al. Position Paper: Using a "Codelet" Program Execution Model for Exascale Machines , 2011 .

[31] Benoît Meister,et al. The Open Community Runtime: A runtime system for extreme scale computing , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[32] Kathleen Knobe,et al. Ease of use with concurrent collections (CnC) , 2009 .

[33] George Bosilca,et al. PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution , 2015, 2015 IEEE International Conference on Cluster Computing.

[34] J. Dongarra,et al. Lightweight Superscalar Task Execution in Distributed Memory , 2014 .

[35] Felipe Maia Galvão França,et al. Task Scheduling in Sucuri Dataflow Library , 2016, 2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW).

[36] Jack Dongarra,et al. ScaLAPACK user's guide , 1997 .

[37] Josep Torrellas,et al. Data Forwarding in Scalable Shared-Memory Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[38] Paraskevas Evripidou,et al. DDMCPP : The Data-Driven Multithreading C PreProcessor , 2007 .

[39] Paraskevas Evripidou,et al. Paradigm Shift for EXASCALE Computing , 2015 .

[40] Thomas Hérault,et al. DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[41] Lapack Working. Scheduling Linear Algebra Operations on Multicore Processors – , 2009 .

[42] Guang R. Gao,et al. Application characterization at scale: lessons learned from developing a distributed open community runtime system for high performance computing , 2016, Conf. Computing Frontiers.

[43] Jack Dongarra,et al. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[44] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[45] Oliver Pell,et al. Maximum Performance Computing with Dataflow Engines , 2012, Computing in Science & Engineering.

[46] Edsger W. Dijkstra,et al. Termination Detection for Diffusing Computations , 1980, Inf. Process. Lett..

[47] Arvind,et al. Two Fundamental Issues in Multiprocessing , 1987, Parallel Computing in Science and Engineering.

[48] Gurindar S. Sohi,et al. Dataflow execution of sequential imperative programs on multicore architectures , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[49] Vivek Sarkar,et al. X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[50] Alejandro Duran,et al. Productive Cluster Programming with OmpSs , 2011, Euro-Par.

[51] Pedro C. Diniz. Exascale Programming Challenges , 2011 .

[52] Eduard Ayguadé,et al. Implementing OmpSs support for regions of data in architectures with multiple address spaces , 2013, ICS '13.

[53] Roberto Giorgi,et al. DTA-C: A Decoupled multi-Threaded Architecture for CMP Systems , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[54] P. Evripidou,et al. FREDDO: an efficient Framework for Runtime Execution of Data-Driven Objects , 2017 .

[55] Vítor Santos Costa,et al. Couillard: Parallel programming via coarse-grained Data-flow Compilation , 2011, Parallel Comput..

[56] Thomas L. Sterling,et al. ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[57] Jack B. Dennis,et al. A preliminary architecture for a basic data-flow processor , 1974, ISCA '75.

[58] Paraskevas Evripidou,et al. Programming multi-core architectures using Data-Flow techniques , 2010, 2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[59] Paraskevas Evripidou. Thread Synchronization Unit (TSU): A Building Block for High Performance Computers , 1997, ISHPC.

[60] Krishna M. Kavi,et al. Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation , 2001, IEEE Trans. Computers.

[61] Paul Krzyzanowski. Distributed shared memory , 1998 .

[62] Nam Ho,et al. Dataflow Support in x86_64 Multicore Architectures through Small Hardware Extensions , 2015, 2015 Euromicro Conference on Digital System Design.

[63] James Demmel,et al. Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..

[64] Arvind,et al. The U-Interpreter , 1982, Computer.

[65] Katherine Yelick,et al. Introduction to UPC and Language Specification , 2000 .

[66] Peter Kilpatrick,et al. Targeting Distributed Systems in FastFlow , 2012, Euro-Par Workshops.

[67] Veljko M. Milutinovic,et al. Distributed shared memory: concepts and systems , 1997, IEEE Parallel Distributed Technol. Syst. Appl..

[68] Jack J. Dongarra,et al. Scaling up matrix computations on shared-memory manycore systems with 1000 CPU cores , 2014, ICS '14.

[69] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).