A high-level model for exploring multi-core architectures

Abstract Understanding bottlenecks in parallel programs is critical to designing more efficient and performant multi-core architectures. Synchronization is a prime example of a potential bottleneck, but is a necessary evil when writing parallel programs; we must enforce correct access to shared data. Even the most expert programmers may find synchronization to be a significant overhead in their application. Techniques to mitigate synchronization overhead include speculative lock elision, faster hardware barriers, and load balancing via dynamic voltage and frequency scaling. A key insight is that the timing of synchronization events, impacted not only by the progress of the current thread but also others, is fundamental to an application’s performance. To enable a better understanding of multithreaded applications, we introduce a new level of abstraction for multi-core evaluation and propose an analytical model centered around the timing and ordering of synchronization events. Our model allows research across the stack to evaluate the performance of applications on future, non-existent systems and architectures. Compared to real hardware, our model estimates performance with a geometric average of 7.2% error across thirteen benchmarks and can generate performance characteristics per thread in less than a minute on average for very large (native) inputs.

[1]  Stijn Eyerman,et al.  Mechanistic Analytical Modeling of Superscalar In-Order Processor Performance , 2014, ACM Trans. Archit. Code Optim..

[2]  James R. Larus,et al.  The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[3]  Margaret Martonosi,et al.  Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors , 2009, ISCA '09.

[4]  Rainer Leupers,et al.  CoEx: A novel profiling-based algorithm/architecture co-exploration for ASIP design , 2013, 2013 8th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC).

[5]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[6]  Stijn Eyerman,et al.  Interval simulation: Raising the level of abstraction in architectural simulation , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[7]  Torsten Hoefler,et al.  Evaluating the Cost of Atomic Operations on Modern Architectures , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[8]  Thomas M. Conte,et al.  Contech: Efficiently Generating Dynamic Task Graphs for Arbitrary Parallel Programs , 2015, TACO.

[9]  Stijn Eyerman,et al.  Modeling critical sections in Amdahl's law and its implications for multicore design , 2010, ISCA '10.

[10]  Leslie Lamport,et al.  Proving the Correctness of Multiprocess Programs , 1977, IEEE Transactions on Software Engineering.

[11]  Christoforos E. Kozyrakis,et al.  ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[12]  Barton P. Miller,et al.  What are race conditions?: Some issues and formalizations , 1992, LOPL.

[13]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  James E. Smith,et al.  Advanced Micro Devices , 2005 .

[15]  Friedemann Mattern,et al.  Virtual Time and Global States of Distributed Systems , 2002 .

[16]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  Jose Renau,et al.  Analysis of PARSEC workload scalability , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[18]  Onur Mutlu,et al.  Bottleneck identification and scheduling in multithreaded applications , 2012, ASPLOS XVII.

[19]  Mary K. Vernon,et al.  Parallel program performance prediction using deterministic task graph analysis , 2004, TOCS.

[20]  Mateo Valero,et al.  On the simulation of large-scale architectures using multiple application abstraction levels , 2012, TACO.

[21]  Baris Taskin,et al.  Synchrotrace: synchronization-aware architecture-agnostic traces for light-weight multicore simulation , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[22]  Yuan Yao,et al.  Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[23]  Stefanos Kaxiras,et al.  Splash-3: A properly synchronized benchmark suite for contemporary research , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[24]  Stijn Eyerman,et al.  Criticality stacks: identifying critical threads in parallel programs using synchronization behavior , 2013, ISCA.

[25]  Norbert Wehn,et al.  Exploring system performance using elastic traces: Fast, accurate and portable , 2016, 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS).

[26]  Abdoulaye Gamatié,et al.  ElasticSimMATE: A fast and accurate gem5 trace-driven simulator for multicore systems , 2017, 2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC).

[27]  Tudor David,et al.  Everything you always wanted to know about synchronization but were afraid to ask , 2013, SOSP.

[28]  David Black-Schaffer,et al.  Efficient techniques for predicting cache sharing and throughput , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[29]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[30]  James E. Smith,et al.  A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..