论文信息 - Harmonizing Speculative and Non-Speculative Execution in Architectures for Ordered Parallelism

Harmonizing Speculative and Non-Speculative Execution in Architectures for Ordered Parallelism

Multicore systems should support both speculative and non-speculative parallelism. Speculative parallelism is easy to use and is crucial to scale many challenging applications, while non-speculative parallelism is more efficient and allows parallel irrevocable actions (e.g., parallel I/O). Unfortunately, prior techniques are far from this goal. Hardware transactional memory (HTM) systems support speculative (transactional) and non-speculative (non-transactional) work, but lack coordination mechanisms between the two, and are limited to unordered parallelism. Prior work has extended HTMs to avoid the limitations of speculative execution, e.g., through escape actions and open-nested transactions. But these mechanisms are incompatible with systems that exploit ordered parallelism, which parallelize a broader range of applications and are easier to use. We contribute two techniques that enable seamlessly composing and coordinating speculative and non-speculative work in the context of ordered parallelism: (i) a task-based execution model that efficiently coordinates concurrent speculative and non-speculative ordered tasks, allowing them to create tasks of either kind and to operate on shared data; and (ii) a safe way for speculative tasks to invoke software-managed speculative actions that avoid hardware version management and conflict detection. These contributions improve efficiency and enable new capabilities. Across several benchmarks, they allow the system to dynamically choose whether to execute tasks speculatively or non-speculatively, avoid needless conflicts among speculative tasks, and allow speculative tasks to safely invoke irrevocable actions.

[1] J. Eliot B. Moss. Open Nested Transactions: Semantics and Support , 2006 .

[2] Andrey Brito,et al. Speculative out-of-order event processing with software transaction memory , 2008, DEBS.

[3] Victor Pankratius,et al. A study of transactional memory vs. locks in practice , 2011, SPAA '11.

[4] Wei Liu,et al. Thread-Level Speculation on a CMP can be energy efficient , 2005, ICS '05.

[5] Antonia Zhai,et al. A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[6] Keshav Pingali,et al. Exploiting the commutativity lattice , 2011, PLDI '11.

[7] Christos Faloutsos,et al. R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[8] Easwaran Raman,et al. Speculative Decoupled Software Pipelining , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[9] Josep Torrellas,et al. OmniOrder: Directory-based conflict serialization of transactions , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[10] Keshav Pingali,et al. Ordered vs. unordered: a comparison of parallelism and work-efficiency in irregular algorithms , 2011, PPoPP '11.

[11] Hiroshi Nakashima,et al. A mechanism for speculative memory accesses following synchronizing operations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[12] Charles E. Leiserson,et al. Ordering heuristics for parallel graph coloring , 2014, SPAA.

[13] Maurice Herlihy,et al. Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[14] Dan Grossman,et al. Lock Prediction , .

[15] James R. Goodman,et al. Transactional lock-free execution of lock-based programs , 2002, ASPLOS X.

[16] Tarek S. Abdelrahman,et al. Hardware Support for Relaxed Concurrency Control in Transactional Memory , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[17] Martin C. Rinard,et al. Commutativity analysis: a new analysis technique for parallelizing compilers , 1997, TOPL.

[18] James R. Goodman,et al. Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[19] Keshav Pingali,et al. The tao of parallelism in algorithms , 2011, PLDI '11.

[20] Christoforos E. Kozyrakis,et al. Flexible architectural support for fine-grain scheduling , 2010, ASPLOS XV.

[21] James Bennett,et al. The Netflix Prize , 2007 .

[22] Mark Moir,et al. Simplifying concurrent algorithms by exploiting hardware transactional memory , 2010, SPAA '10.

[23] Michael L. Scott,et al. Sandboxing transactional memory , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24] Larry Carter,et al. Universal classes of hash functions (Extended Abstract) , 1977, STOC '77.

[25] Guy E. Blelloch,et al. Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing , 2017, SPAA.

[26] Guy E. Blelloch,et al. Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[27] David A. Wood,et al. LogTM-SE: Decoupling Hardware Transactional Memory from Caches , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[28] Eddie Kohler,et al. Speedy transactions in multicore in-memory databases , 2013, SOSP.

[29] Cong Yan,et al. A scalable architecture for ordered parallelism , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30] Rachid Guerraoui,et al. On the correctness of transactional memory , 2008, PPoPP.

[31] Daniel Sánchez,et al. SAM: Optimizing Multithreaded Cores for Speculative Parallelism , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32] Krste Asanovic,et al. Controlling program execution through binary instrumentation , 2005, CARN.

[33] Steven L. Scott,et al. Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[34] Josep Torrellas,et al. Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors , 2005, TACO.

[35] David A. Wood,et al. Performance Pathologies in Hardware Transactional Memory , 2007, IEEE Micro.

[36] Emmett Witchel,et al. Is transactional programming actually easier? , 2010, PPoPP '10.

[37] Kunle Olukotun,et al. Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[38] Kunle Olukotun,et al. Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[39] Hansen Zhang,et al. Hardware Multithreaded Transactions , 2018, ASPLOS.

[40] F. Maxwell Harper,et al. The MovieLens Datasets: History and Context , 2016, TIIS.

[41] Niraj K. Jha,et al. GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[42] David A. Wood,et al. LogTM: log-based transactional memory , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[43] Josep Torrellas,et al. Hardware for speculative run-time parallelization in distributed shared-memory multiprocessors , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[44] David R. Jefferson,et al. Virtual time , 1985, ICPP.

[45] Matei Zaharia,et al. Making caches work for graph analytics , 2016, 2017 IEEE International Conference on Big Data (Big Data).

[46] David A. Wood,et al. Supporting nested transactional memory in logTM , 2006, ASPLOS XII.

[47] Todd C. Mowry,et al. The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[48] Luke Dalessandro Michael,et al. Strong Isolation is a Weak Idea , 2009 .

[49] Bradley C. Kuszmaul. SuperMalloc: a super fast multithreaded malloc for 64-bit machines , 2015, ISMM.

[50] Guy E. Blelloch,et al. Brief announcement: the problem based benchmark suite , 2012, SPAA '12.

[51] Emmett Witchel,et al. Dependence-aware transactional memory for increased concurrency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[52] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[53] T. N. Vijaykumar,et al. Wait-n-GoTM: improving HTM performance by serializing cyclic dependencies , 2013, ASPLOS '13.

[54] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[55] Kunle Olukotun,et al. STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[56] Joel Emer,et al. Unlocking Ordered Parallelism with the Swarm Architecture , 2016, IEEE Micro.

[57] Christopher J. Hughes,et al. Performance evaluation of Intel® Transactional Synchronization Extensions for high-performance computing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[58] Constantine D. Polychronopoulos,et al. Fast barrier synchronization hardware , 1990, Proceedings SUPERCOMPUTING '90.

[59] Guang R. Gao,et al. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.

[60] Henry Hoffmann,et al. On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[61] Josep Torrellas,et al. Speculative synchronization: applying thread-level speculation to explicitly parallel applications , 2002, ASPLOS X.

[62] Eduard Ayguadé,et al. Task Superscalar: An Out-of-Order Task Pipeline , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[63] A. McDonald,et al. Architectural Semantics for Practical Transactional Memory , 2006, ISCA 2006.

[64] Donald E. Porter,et al. TxLinux: using and managing hardware transactional memory in an operating system , 2007, SOSP.

[65] Maged M. Michael,et al. Robust architectural support for transactional memory in the power architecture , 2013, ISCA.

[66] Charles E. Leiserson,et al. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers) , 2010, SPAA '10.

[67] Milo M. K. Martin,et al. Subtleties of transactional memory atomicity semantics , 2006, IEEE Computer Architecture Letters.

[68] Corporate Unix Press. System V application binary interface (3rd ed.) , 1993 .

[69] Arturo González-Escribano,et al. A Survey on Thread-Level Speculation Techniques , 2016, ACM Comput. Surv..

[70] Ulrich Meyer,et al. [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[71] Simon L. Peyton Jones,et al. Composable memory transactions , 2005, CACM.

[72] Josep Torrellas,et al. Bulk Disambiguation of Speculative Threads in Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[73] Emilio L. Zapata,et al. Effective Transactional Memory Execution Management for Improved Concurrency , 2014, ACM Trans. Archit. Code Optim..

[74] Dean M. Tullsen,et al. Mapping Out a Path from Hardware Transactional Memory to Speculative Multithreading , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[75] Adam Silberstein,et al. Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[76] Daniel Sánchez,et al. Fractal: An execution model for fine-grain nested speculative parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[77] J. P. Grossman,et al. Hardware support for fine-grained event-driven computation in Anton 2 , 2013, ASPLOS '13.

[78] Michael M. Swift,et al. Pathological Interaction of Locks with Transactional Memory , 2008 .

[79] Craig Zilles,et al. Extending Hardware Transactional Memory to Support Non-busy Waiting and Non-transactional Actions , 2006 .

[80] William J. Dally,et al. The J-machine Multicomputer: An Architectural Evaluation , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[81] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[82] Eddie Kohler,et al. The scalable commutativity rule , 2017, Commun. ACM.

[83] Antonia Zhai,et al. Compiler optimization of memory-resident value communication between speculative threads , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[84] Craig B. Zilles,et al. An Analysis of I/O And Syscalls In Critical Sections And Their Implications For Transactional Memory , 2008, ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software.

[85] Daniel Sánchez,et al. Data-centric execution of speculative parallel programs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[86] Christopher J. Hughes,et al. Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[87] Michael M. Swift,et al. Condition Variables and Transactional Memory : Problem or Opportunity ? , 2009 .

[88] Cody Cutler,et al. Phase Reconciliation for Contended In-Memory Transactions , 2014, OSDI.

[89] Mateo Valero,et al. Architectural Support for Task Dependence Management with Flexible Software Scheduling , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[90] William J. Dally,et al. Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[91] Wei Liu,et al. Tasking with out-of-order spawn in TLS chip multiprocessors: microarchitecture and compilation , 2005, ICS '05.

[92] Ali-Reza Adl-Tabatabai,et al. McRT-Malloc: a scalable transactional memory allocator , 2006, ISMM '06.

[93] Milo M. K. Martin,et al. Making the fast case common and the uncommon case simple in unbounded transactional memory , 2007, ISCA '07.

[94] Benoît Dupont de Dinechin,et al. A clustered manycore processor architecture for embedded and accelerated applications , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).