SAM: Optimizing Multithreaded Cores for Speculative Parallelism

This work studies the interplay between multithreaded cores and speculative parallelism (e.g., transactional memory or thread-level speculation). These techniques are often used together, yet they have been developed independently. This disconnect causes major performance pathologies: increasing the number of threads per core adds conflicts and wasted work, and puts pressure on speculative execution resources. These pathologies often squander the benefits of multithreading.We present speculation-aware multithreading (SAM), a simple policy that addresses these pathologies. By coordinating instruction dispatch and conflict resolution priorities, SAM focuses execution resources on work that is more likely to commit, avoiding aborts and using speculation resources more efficiently.We design SAM variants for in-order and out-of-order cores. SAM is cheap to implement and makes multithreaded cores much more beneficial on speculative parallel programs. We evaluate SAM on systems with up to 64 SMT cores. With SAM, 8-threaded cores outperform single-threaded cores by 2.33x on average, while a speculation-oblivious policy yields a 1.85x speedup. SAM also reduces wasted work by 52%.

[1]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[2]  Josep Torrellas,et al.  Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors , 2005, TACO.

[3]  Josep Torrellas,et al.  BulkSMT: Designing SMT processors for atomic-block execution , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[4]  David A. Wood,et al.  Performance Pathologies in Hardware Transactional Memory , 2007, IEEE Micro.

[5]  Brad Calder,et al.  Threaded multiple path execution , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[6]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[7]  Timothy J. Slegel,et al.  Transactional Memory Architecture and Implementation for IBM System Z , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[8]  Larry Carter,et al.  Universal classes of hash functions (Extended Abstract) , 1977, STOC '77.

[9]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[10]  Lawrence Rauchwerger,et al.  The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.

[11]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[12]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[13]  M TullsenDean,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000 .

[14]  Burton J. Smith,et al.  The architecture of HEP , 1985 .

[15]  Ronald G. Dreslinski,et al.  Proactive transaction scheduling for contention management , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Maged M. Michael,et al.  Evaluation of Blue Gene/Q hardware support for transactional memories , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  Antonia Zhai,et al.  Efficiency of thread-level speculation in SMT and CMP architectures - performance, power and thermal perspective , 2008, 2008 IEEE International Conference on Computer Design.

[18]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[19]  Josep Torrellas,et al.  Bulk Disambiguation of Speculative Threads in Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[20]  Christoforos E. Kozyrakis,et al.  ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[21]  T. N. Vijaykumar,et al.  Implicitly-multithreaded processors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[22]  Ronald G. Dreslinski,et al.  Bloom Filter Guided Transaction Scheduling , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[23]  Tor M. Aamodt,et al.  Energy efficient GPU transactional memory via space-time optimizations , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Robert Golla,et al.  T4: A highly threaded server-on-a-chip with native support for heterogeneous computing , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[25]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[26]  Peter S. Pacheco Parallel programming with MPI , 1996 .

[27]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[28]  José González,et al.  Meeting points: Using thread criticality to adapt multicore hardware to parallel regions , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[29]  Steven K. Reinhardt,et al.  The impact of resource partitioning on SMT processors , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[30]  William N. Scherer,et al.  Advanced contention management for dynamic software transactional memory , 2005, PODC '05.

[31]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[32]  Maurice Herlihy,et al.  Virtualizing transactional memory , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[33]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[34]  Andrew Brownsword,et al.  Hardware transactional memory for GPU architectures , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[35]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[36]  Josep Torrellas,et al.  BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.

[37]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[38]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[39]  Joel Emer,et al.  Unlocking Ordered Parallelism with the Swarm Architecture , 2016, IEEE Micro.

[40]  Donald S. Fussell,et al.  Priority-based cache allocation in throughput processors , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[41]  Mike Houston,et al.  GPUs: A Closer Look , 2008, ACM Queue.

[42]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[43]  Onur Mutlu,et al.  Bottleneck identification and scheduling in multithreaded applications , 2012, ASPLOS XVII.

[44]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[45]  Daniel Sánchez,et al.  Fractal: An execution model for fine-grain nested speculative parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[46]  Francisco J. Cazorla,et al.  A dynamic scheduler for balancing HPC applications , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[47]  Hsien-Hsin S. Lee,et al.  Adaptive transaction scheduling for transactional memory systems , 2008, SPAA '08.

[48]  Brandon Lucia,et al.  DMP: Deterministic Shared-Memory Multiprocessing , 2010, IEEE Micro.

[49]  Scott A. Mahlke,et al.  Mascar: Speeding up GPU warps by reducing memory pitstops , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[50]  Eddie Kohler,et al.  Speedy transactions in multicore in-memory databases , 2013, SOSP.

[51]  Cong Yan,et al.  A scalable architecture for ordered parallelism , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[52]  Francisco J. Cazorla,et al.  Software-Controlled Priority Characterization of POWER5 Processor , 2008, 2008 International Symposium on Computer Architecture.

[53]  Daniel Sánchez,et al.  Data-centric execution of speculative parallel programs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[54]  Emmett Witchel,et al.  Dependence-aware transactional memory for increased concurrency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[55]  T. N. Vijaykumar,et al.  Wait-n-GoTM: improving HTM performance by serializing cyclic dependencies , 2013, ASPLOS '13.

[56]  Wei Liu,et al.  Tasking with out-of-order spawn in TLS chip multiprocessors: microarchitecture and compilation , 2005, ICS '05.

[57]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[58]  Christopher J. Hughes,et al.  Performance evaluation of Intel® Transactional Synchronization Extensions for high-performance computing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[59]  T. N. Vijaykumar,et al.  Timetraveler: exploiting acyclic races for optimizing memory race recording , 2010, ISCA.

[60]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[61]  David A. Wood,et al.  LogTM: log-based transactional memory , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[62]  Josep Torrellas,et al.  Hardware for speculative run-time parallelization in distributed shared-memory multiprocessors , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[63]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[64]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[65]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[66]  Maged M. Michael,et al.  Quantitative comparison of Hardware Transactional Memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8 , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[67]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[68]  Tarek S. Abdelrahman,et al.  Hardware Support for Relaxed Concurrency Control in Transactional Memory , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[69]  Wei Liu,et al.  Thread-Level Speculation on a CMP can be energy efficient , 2005, ICS '05.

[70]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[71]  Keshav Pingali,et al.  Ordered vs. unordered: a comparison of parallelism and work-efficiency in irregular algorithms , 2011, PPoPP '11.

[72]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[73]  Krste Asanovic,et al.  Controlling program execution through binary instrumentation , 2005, CARN.

[74]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[75]  D. J. A. Welsh,et al.  An upper bound for the chromatic number of a graph and its application to timetabling problems , 1967, Comput. J..

[76]  Dean M. Tullsen,et al.  Supporting fine-grained synchronization on a simultaneous multithreading processor , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[77]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[78]  Maged M. Michael,et al.  Robust architectural support for transactional memory in the power architecture , 2013, ISCA.

[79]  Easwaran Raman,et al.  Speculative Decoupled Software Pipelining , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[80]  James R. Goodman,et al.  Transactional lock-free execution of lock-based programs , 2002, ASPLOS X.

[81]  Robert Morris,et al.  Non-scalable locks are dangerous , 2012 .

[82]  Guy E. Blelloch,et al.  Internally deterministic parallel algorithms can be fast , 2012, PPoPP '12.

[83]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[84]  Kunle Olukotun,et al.  Maximizing CMP throughput with mediocre cores , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[85]  Arturo González-Escribano,et al.  A Survey on Thread-Level Speculation Techniques , 2016, ACM Comput. Surv..

[86]  Josep Torrellas,et al.  OmniOrder: Directory-based conflict serialization of transactions , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[87]  Mikel Luján,et al.  Steal-on-Abort: Improving Transactional Memory Performance through Dynamic Transaction Reordering , 2008, HiPEAC.

[88]  Charles E. Leiserson,et al.  Ordering heuristics for parallel graph coloring , 2014, SPAA.

[89]  David A. Wood,et al.  LogTM-SE: Decoupling Hardware Transactional Memory from Caches , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.