Accelerating relax-ordered task-parallel workloads using multi-level dependency checking

Work-efficient task-parallel algorithms enforce ordered execution of tasks using priority schedulers. These algorithms suffer from limited parallelism due to data movement and synchronization bottlenecks. State-of-the-art priority schedulers relax the ordering of tasks to avoid false dependencies generated by strict queuing constraints, thus unlocking task parallelism. However, relaxing task dependencies results in shared data races among cores that lead to redundant task computations in concurrently executing threads. Although static algorithm optimizations have been shown to reduce redundant work, they do not exploit the tradeoff between parallelism and work efficiency that is only exposed during runtime. This paper proposes a task dependency checking mechanism that dynamically tracks the monotonic property of parent-child relationships across multiple levels from any given task. Since shared memory writes are known to be slower than concurrent reads, the multi-level checks effectively detect task dependency races to prune redundant tasks. Evaluation of relax-ordered algorithms on a 40-core Intel Xeon multicore shows an average of 44% performance improvement over the Galois obim scheduler.

[1]  Omer Khan,et al.  CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores , 2015, 2015 IEEE International Symposium on Workload Characterization.

[2]  Daniel Sánchez,et al.  SAM: Optimizing Multithreaded Cores for Speculative Parallelism , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Bingsheng He,et al.  Accelerating Dynamic Graph Analytics on GPUs , 2017, Proc. VLDB Endow..

[4]  Guy E. Blelloch,et al.  Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing , 2017, SPAA.

[5]  Nir Shavit,et al.  The big data challenges of connectomics , 2014, Nature Neuroscience.

[6]  Omer Khan,et al.  GPU concurrency choices in graph analytics , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[7]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[8]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[9]  Andrew V. Goldberg,et al.  The Shortest Path Problem , 2009 .

[10]  Kevin Skadron,et al.  Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[11]  Arturo González-Escribano,et al.  A Survey on Thread-Level Speculation Techniques , 2016, ACM Comput. Surv..

[12]  Ryan A. Rossi,et al.  The Network Data Repository with Interactive Graph Analytics and Visualization , 2015, AAAI.

[13]  Sebastiano Vigna,et al.  The Graph Structure in the Web - Analyzed on Different Aggregation Levels , 2015, J. Web Sci..

[14]  Keshav Pingali,et al.  Lonestar: A suite of parallel irregular programs , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[15]  Johannes Gehrke,et al.  Asynchronous Large-Scale Graph Processing Made Easy , 2013, CIDR.

[16]  Daniel Sánchez,et al.  Fractal: An execution model for fine-grain nested speculative parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[17]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[18]  Cynthia A. Phillips,et al.  Scalable generation of graphs for benchmarking HPC community-detection algorithms , 2019, SC.

[19]  Yichao Zhou,et al.  Massively Parallel A* Search on a GPU , 2015, AAAI.

[20]  Nancy M. Amato,et al.  KLA: A new algorithmic paradigm for parallel graph computations , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[21]  Keshav Pingali,et al.  Kinetic Dependence Graphs , 2015, ASPLOS.

[22]  Guy E. Blelloch,et al.  Brief announcement: the problem based benchmark suite , 2012, SPAA '12.

[23]  Keshav Pingali,et al.  Priority Queues Are Not Good Concurrent Priority Schedulers , 2015, Euro-Par.

[24]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[25]  Kalyan Veeramachaneni,et al.  Autotuning algorithmic choice for input sensitivity , 2015, PLDI.

[26]  Keshav Pingali,et al.  Ordered vs. unordered: a comparison of parallelism and work-efficiency in irregular algorithms , 2011, PPoPP '11.

[27]  John D. Owens,et al.  Gunrock , 2017, ACM Trans. Parallel Comput..

[28]  David A. Bader,et al.  STINGER: High performance data structure for streaming graphs , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[29]  Cong Yan,et al.  A scalable architecture for ordered parallelism , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30]  Guy E. Blelloch,et al.  Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable , 2018, SPAA.

[31]  Luiz Chaimowicz,et al.  A Survey and Classification of A* Based Best-First Heuristic Search Algorithms , 2010, SBIA.