Data-centric execution of speculative parallel programs

Multicore systems must exploit locality to scale, scheduling tasks to minimize data movement. While locality-aware parallelism is well studied in non-speculative systems, it has received little attention in speculative systems (e.g., HTM or TLS), which hinders their scalability. We present spatial hints, a technique that leverages program knowledge to reveal and exploit locality in speculative parallel programs. A hint is an abstract integer, given when a speculative task is created, that denotes the data that the task is likely to access. We show it is easy to modify programs to convey locality through hints. We design simple hardware techniques that allow a state-of-the-art, tiled speculative architecture to exploit hints by: (i) running tasks likely to access the same data on the same tile, (ii) serializing tasks likely to conflict, and (iii) balancing tasks across tiles in a locality-aware fashion. We also show that programs can often be restructured to make hints more effective. Together, these techniques make speculative parallelism practical on large-scale systems: at 256 cores, hints achieve near-linear scalability on nine challenging applications, improving performance over hint-oblivious scheduling by 3.3× gmean and by up to 16×. Hints also make speculation far more efficient, reducing wasted work by 6.4× and traffic by 3.5× on average.

[1]  Jacob Nelson,et al.  Latency-Tolerant Distributed Shared Memory For Data-Intensive Applications , 2015 .

[2]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[3]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[4]  Osman S. Unsal,et al.  HARP: Adaptive abort recurrence prediction for Hardware Transactional Memory , 2013, 20th Annual International Conference on High Performance Computing.

[5]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[6]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[7]  Ronald G. Dreslinski,et al.  Proactive transaction scheduling for contention management , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Vivek Sarkar,et al.  Deadlock-free scheduling of X10 computations with bounded resources , 2007, SPAA '07.

[9]  Christoforos E. Kozyrakis,et al.  Locality-aware task management for unstructured parallelism: a quantitative limit study , 2013, SPAA.

[10]  Ye Sun,et al.  Distributed Transactional Memory for Metric-Space Networks , 2005, DISC.

[11]  Keshav Pingali,et al.  Priority Queues Are Not Good Concurrent Priority Schedulers , 2015, Euro-Par.

[12]  Keshav Pingali,et al.  Synthesizing concurrent schedulers for irregular algorithms , 2011, ASPLOS XVI.

[13]  Josep Torrellas,et al.  Bulk Disambiguation of Speculative Threads in Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[14]  Christoforos E. Kozyrakis,et al.  ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[15]  Alejandro Duran,et al.  Evaluation of OpenMP Task Scheduling Strategies , 2008, IWOMP.

[16]  Hagit Attiya,et al.  R EL STM : A Proactive Transactional Memory Scheduler ∗ , 2013 .

[17]  Christoforos E. Kozyrakis,et al.  Dynamic Fine-Grain Scheduling of Pipeline Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[18]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[19]  Ronald G. Dreslinski,et al.  Bloom Filter Guided Transaction Scheduling , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[20]  Krste Asanovic,et al.  Controlling program execution through binary instrumentation , 2005, CARN.

[21]  Eddie Kohler,et al.  Speedy transactions in multicore in-memory databases , 2013, SOSP.

[22]  Cong Yan,et al.  A scalable architecture for ordered parallelism , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Danny Hendler,et al.  CAR-STM: scheduling-based collision avoidance and resolution for software transactional memory , 2008, PODC '08.

[24]  Nuno Diegues,et al.  Seer: Probabilistic Scheduling for Hardware Transactional Memory , 2015, ACM Trans. Comput. Syst..

[25]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[26]  Ye Sun,et al.  Distributed transactional memory for metric-space networks , 2005, Distributed Computing.

[27]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[28]  Kunle Olukotun,et al.  A Scalable, Non-blocking Approach to Transactional Memory , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[29]  Guy E. Blelloch,et al.  The Data Locality of Work Stealing , 2002, SPAA '00.

[30]  Bradford L. Chamberlain,et al.  Software transactional memory for large scale clusters , 2008, PPoPP.

[31]  J. P. Grossman,et al.  Hardware support for fine-grained event-driven computation in Anton 2 , 2013, ASPLOS '13.

[32]  Keshav Pingali,et al.  Optimistic parallelism benefits from data partitioning , 2008, ASPLOS.

[33]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[34]  Emmett Kilgariff,et al.  Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[35]  Luis Ceze,et al.  Alembic: automatic locality extraction via migration , 2014, OOPSLA.

[36]  Keshav Pingali,et al.  Synthesizing parallel graph programs via automated planning , 2015, PLDI.

[37]  Rachid Guerraoui,et al.  Preventing versus curing: avoiding conflicts in transactional memories , 2009, PODC '09.

[38]  Guy E. Blelloch,et al.  Scheduling threads for constructive cache sharing on CMPs , 2007, SPAA '07.

[39]  Wei Liu,et al.  Thread-Level Speculation on a CMP can be energy efficient , 2005, ICS '05.

[40]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[41]  Keshav Pingali,et al.  Ordered vs. unordered: a comparison of parallelism and work-efficiency in irregular algorithms , 2011, PPoPP '11.

[42]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[43]  David A. Wood,et al.  LogTM: log-based transactional memory , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[44]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[45]  D. J. A. Welsh,et al.  An upper bound for the chromatic number of a graph and its application to timetabling problems , 1967, Comput. J..

[46]  Guy E. Blelloch,et al.  Experimental Analysis of Space-Bounded Schedulers , 2016, ACM Trans. Parallel Comput..

[47]  Larry Carter,et al.  Universal classes of hash functions (Extended Abstract) , 1977, STOC '77.

[48]  Josep Torrellas,et al.  ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[49]  Pat Hanrahan,et al.  GRAMPS: A programming model for graphics pipelines , 2009, ACM Trans. Graph..

[50]  Joel Emer,et al.  Unlocking Ordered Parallelism with the Swarm Architecture , 2016, IEEE Micro.

[51]  Binoy Ravindran,et al.  HyFlow: a high performance distributed software transactional memory framework , 2011, HPDC '11.

[52]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[53]  J. Gregory Steffan,et al.  Improving cache locality for thread-level speculation , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[54]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[55]  Keshav Pingali,et al.  Scheduling strategies for optimistic parallel execution of irregular programs , 2008, SPAA '08.

[56]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[57]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, IPDPS.

[58]  Jason Duell,et al.  Productivity and performance using partitioned global address space languages , 2007, PASCO '07.

[59]  Christoforos E. Kozyrakis,et al.  Flexible architectural support for fine-grain scheduling , 2010, ASPLOS XV.

[60]  Mikel Luján,et al.  Steal-on-Abort: Improving Transactional Memory Performance through Dynamic Transaction Reordering , 2008, HiPEAC.

[61]  Charles E. Leiserson,et al.  Ordering heuristics for parallel graph coloring , 2014, SPAA.

[62]  Charles E. Leiserson,et al.  A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers) , 2010, SPAA '10.

[63]  Wei Liu,et al.  Tasking with out-of-order spawn in TLS chip multiprocessors: microarchitecture and compilation , 2005, ICS '05.

[64]  Benoît Dupont de Dinechin,et al.  A clustered manycore processor architecture for embedded and accelerated applications , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[65]  David A. Wood,et al.  LogTM-SE: Decoupling Hardware Transactional Memory from Caches , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[66]  Emmett Witchel,et al.  Is transactional programming actually easier? , 2010, PPoPP '10.

[67]  Hsien-Hsin S. Lee,et al.  Adaptive transaction scheduling for transactional memory systems , 2008, SPAA '08.

[68]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[69]  Josep Torrellas,et al.  Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors , 2005, TACO.

[70]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[71]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[72]  Tim Weninger,et al.  Thinking Like a Vertex , 2015, ACM Comput. Surv..

[73]  T. N. Vijaykumar,et al.  Wait-n-GoTM: improving HTM performance by serializing cyclic dependencies , 2013, ASPLOS '13.