Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing System

Taskflow aims to streamline the building of parallel and heterogeneous applications using a lightweight task graph-based approach. Taskflow introduces an expressive task graph programming model to assist developers in the implementation of parallel and heterogeneous decomposition strategies on a heterogeneous computing platform. Our programming model distinguishes itself as a very general class of task graph parallelism with in-graph control flow to enable end-to-end parallel optimization. To support our model with high performance, we design an efficient system runtime that solves many of the new scheduling challenges arising out of our models and optimizes the performance across latency, energy efficiency, and throughput. We have demonstrated the promising performance of Taskflow in real-world applications. As an example, Taskflow solves a large-scale machine learning workload up to 29% faster, 1.5× less memory, and 1.9× higher throughput than the industrial system, oneTBB, on a machine of 40 CPUs and 4 GPUs. We have opened the source of Taskflow and deployed it to large numbers of users in the open-source community.

[1]  Martin D. F. Wong,et al.  GPU-accelerated Path-based Timing Analysis , 2021, 2021 58th ACM/IEEE Design Automation Conference (DAC).

[2]  Bruno Raffin,et al.  Design and analysis of scheduling strategies for multi-CPU and multi-GPU architectures , 2015, Parallel Comput..

[3]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[4]  Quan Chen,et al.  LAWS: locality-aware work-stealing for multi-socket multi-core architectures , 2014, ICS '14.

[5]  Yuxiong He,et al.  Adaptive work-stealing with parallelism feedback , 2008, TOCS.

[6]  Vivek Sarkar,et al.  A scalable locality-aware adaptive work-stealing scheduler for multi-core task parallelism , 2010 .

[7]  Keshav Pingali,et al.  Can Parallel Programming Revolutionize EDA Tools? , 2018, Advanced Logic Synthesis.

[8]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[9]  Marco Danelutto,et al.  FastFlow: High-level and Efficient Streaming on Multi-core , 2017 .

[10]  Martin D. F. Wong,et al.  OpenTimer v2: A New Parallel Incremental Timing Analysis Engine , 2021, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[11]  Charles E. Leiserson,et al.  Executing task graphs using work-stealing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[12]  Nan Sun,et al.  MAGICAL: Toward Fully Automated Analog IC Layout Leveraging Human and Machine Intelligence: Invited Paper , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[13]  Tsung-Wei Huang,et al.  GPU-Accelerated Static Timing Analysis , 2020, 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD).

[14]  Jürgen Teich,et al.  The Best of Both Worlds: Combining CUDA Graph with an Image Processing DSL , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[15]  Martin D. F. Wong,et al.  An Efficient Work-Stealing Scheduler for Task Dependency Graph , 2020, 2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS).

[16]  Sriram Krishnamoorthy,et al.  Lifeline-based global load balancing , 2011, PPoPP '11.

[17]  Martín Abadi,et al.  Dynamic control flow in large-scale machine learning , 2018, EuroSys.

[18]  Dieter Schmalstieg,et al.  Whippletree , 2014, ACM Trans. Graph..

[19]  D. F. Wong,et al.  Simulated Annealing for VLSI Design , 1988 .

[20]  Jin Hu,et al.  TAU 2015 contest on incremental timing analysis , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[21]  Sriram Krishnamoorthy,et al.  Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[22]  Martin D. F. Wong,et al.  Cpp-Taskflow: Fast Task-Based Parallel Programming Using Modern C++ , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[23]  Albert Cohen,et al.  Correct and efficient work-stealing for weak memory models , 2013, PPoPP '13.

[24]  Charles E. Leiserson,et al.  On the efficiency of localized work stealing , 2016, Inf. Process. Lett..

[25]  Eduardo Quiñones,et al.  OpenMP to CUDA graphs: a compiler-based transformation to enhance the programmability of NVIDIA devices , 2020, SCOPES.

[26]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[27]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[28]  Wolfram Schulte,et al.  The design of a task parallel library , 2009, OOPSLA '09.

[29]  Jeremy Kepner,et al.  Sparse Deep Neural Network Graph Challenge , 2019, 2019 IEEE High Performance Extreme Computing Conference (HPEC).

[30]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[31]  Xiaoning Ding,et al.  BWS: balanced work stealing for time-sharing multicores , 2012, EuroSys '12.

[32]  Hartmut Kaiser,et al.  HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.

[33]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[35]  Seyong Lee,et al.  Early evaluation of directive-based GPU programming models for productive exascale computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Tsung-Wei Huang,et al.  Taskflow: A General-Purpose Parallel and Heterogeneous Task Programming System , 2022, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[37]  Andrew B. Kahng,et al.  INVITED: Toward an Open-Source Digital Flow: First Learnings from the OpenROAD Project , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[38]  Yu David Liu,et al.  Energy-efficient work-stealing language runtimes , 2014, ASPLOS.

[39]  Olivier Tardieu,et al.  A work-stealing scheduler for X10's task parallelism with suspension , 2012, PPoPP '12.

[40]  Martin D. F. Wong,et al.  Cpp-Taskflow: A General-Purpose Parallel Task Programming System at Scale , 2021, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[41]  Quan Chen,et al.  Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory , 2018, ACM Trans. Archit. Code Optim..

[42]  Mauro Bisson,et al.  A GPU Implementation of the Sparse Deep Neural Network Graph Challenge , 2019, 2019 IEEE High Performance Extreme Computing Conference (HPEC).

[43]  David Z. Pan,et al.  ABCDPlace: Accelerated Batch-Based Concurrent Detailed Placement on Multithreaded CPUs and GPUs , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[44]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[45]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[46]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.