Towards a Deterministic Fine-Grained Task Ordering Using Multi-Versioned Memory

Task-based programming models aim to simplify parallel programming. A runtime system schedules tasks to execute on cores. An essential component of this runtime is to track and manage dependencies between tasks. A typical approach is to rely on programmers to annotate tasks and data structures, essentially manually specifying the input and output of each task. As such, dependencies are associated with named program objects, making this approach problematic for pointer-based data structures. Furthermore, because the runtime system must track these dependencies, for efficient runtime performance the read and write sets should be kept small.We presume a memory system with architecturally visible support for multiple versions of data stored at the same program address. This paper proposes and evaluates a task-based execution model that uses this versioned memory system to deterministically parallelize sequential code. We have built a task-based runtime layer that uses this type of memory system for dependence tracking. We demonstrate the advantages of the proposed model by parallelizing pointer-heavy code, obtaining speedup of up to 19x on a 32-core system.

[1]  Guido Araujo,et al.  The Batched DOACROSS loop parallelization algorithm , 2015, 2015 International Conference on High Performance Computing & Simulation (HPCS).

[2]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[3]  Eduard Ayguadé,et al.  Task Superscalar: An Out-of-Order Task Pipeline , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Cong Yan,et al.  A scalable architecture for ordered parallelism , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Rudolf Bayer,et al.  Concurrency of operations on B-trees , 1994, Acta Informatica.

[6]  Gu-Yeon Wei,et al.  HELIX: automatic parallelization of irregular programs for chip multiprocessing , 2012, CGO '12.

[7]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[8]  Kunle Olukotun,et al.  Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor , 1997 .

[9]  Jesús Labarta,et al.  Handling task dependencies under strided and aliased references , 2010, ICS '10.

[10]  Josep Torrellas,et al.  Architectural support for scalable speculative parallelization in shared-memory multiprocessors , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[11]  Gurindar S. Sohi,et al.  Speculative Versioning Cache , 2001, IEEE Trans. Parallel Distributed Syst..

[12]  John G. Cleary,et al.  Timestamp representations for virtual sequences , 1997 .

[13]  Tarek S. Abdelrahman,et al.  Architectural support for synchronization-free deterministic parallel programming , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[14]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[15]  Albert Cohen,et al.  OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs , 2012, TACO.

[16]  Wei Liu,et al.  Tasking with out-of-order spawn in TLS chip multiprocessors: microarchitecture and compilation , 2005, ICS '05.

[17]  David I. August,et al.  Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[18]  Dimitrios S. Nikolopoulos,et al.  A Unified Scheduler for Recursive and Task Dataflow Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[19]  Eduard Ayguadé,et al.  Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..

[20]  R. Karp,et al.  Properties of a model for parallel computations: determinacy , 1966 .

[21]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[22]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[23]  Eduard Ayguadé,et al.  Task-Based Programming with OmpSs and Its Application , 2014, Euro-Par Workshops.

[24]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[25]  Gurindar S. Sohi,et al.  Dataflow execution of sequential imperative programs on multicore architectures , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[27]  Mark Oskin,et al.  O-structures: semantics for versioned memory , 2014, MSPC@PLDI.

[28]  Josep Torrellas,et al.  A Chip-Multiprocessor Architecture with Speculative Multithreading , 1999, IEEE Trans. Computers.

[29]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[30]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[31]  Keshav Pingali,et al.  I-structures: Data structures for parallel computing , 1986, Graph Reduction.

[32]  Richard W. Vuduc,et al.  Branch-Avoiding Graph Algorithms , 2014, SPAA.

[33]  Eduard Ayguadé,et al.  Integrating Dataflow Abstractions into the Shared Memory Model , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[34]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[35]  Arvind,et al.  M-Structures: Extending a Parallel, Non-strict, Functional Language with State , 1991, FPCA.

[36]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[37]  Ron Cytron,et al.  Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[38]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[39]  Andreas Moshovos,et al.  Dynamic Speculation and Synchronization of Data Dependences , 1997, ISCA.

[40]  Jacob Nelson,et al.  Latency-Tolerant Software Distributed Shared Memory , 2015, USENIX ATC.

[41]  John Paul Shen,et al.  Mitosis: A Speculative Multithreaded Processor Based on Precomputation Slices , 2008, IEEE Transactions on Parallel and Distributed Systems.

[42]  Antonia Zhai,et al.  The STAMPede approach to thread-level speculation , 2005, TOCS.