TaskPoint: Sampled simulation of task-based programs

Sampled simulation is a mature technique for reducing simulation time of single-threaded programs, but it is not directly applicable to simulation of multi-threaded architectures. Recent multi-threaded sampling techniques assume that the workload assigned to each thread does not change across multiple executions of a program. This assumption does not hold for dynamically scheduled task-based programming models. Task-based programming models allow the programmer to specify program segments as tasks which are instantiated many times and scheduled dynamically to available threads. Due to system noise and variation in scheduling decisions, two consecutive executions on the same machine typically result in different instruction streams processed by each thread. In this paper, we propose TaskPoint, a sampled simulation technique for dynamically scheduled task-based programs. We leverage task instances as sampling units and simulate only a fraction of all task instances in detail. Between detailed simulation intervals we employ a novel fast-forward mechanism for dynamically scheduled programs. We evaluate the proposed technique on a set of 19 task-based parallel benchmarks and two different architectures. Compared to detailed simulation, TaskPoint accelerates architectural simulation with 64 simulated threads by an average factor of 19.1 at an average error of 1.8% and a maximum error of 15.0%.

[1]  Jose Renau,et al.  ESESC: A fast multicore simulator using Time-Based Sampling , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[2]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[3]  Lieven Eeckhout,et al.  Self-monitored adaptive cache warm-up for microprocessor simulation , 2004, 16th Symposium on Computer Architecture and High Performance Computing.

[4]  Alejandro Duran,et al.  Trace-driven simulation of multithreaded applications , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[5]  Sangyeun Cho,et al.  Accurately approximating superscalar processor performance from traces , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[6]  Lieven Eeckhout,et al.  BarrierPoint: Sampled simulation of multi-threaded applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7]  Kevin Skadron,et al.  Memory reference reuse latency: Accelerated warmup for sampled microarchitecture simulation , 2003, 2003 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS 2003..

[8]  Stijn Eyerman,et al.  Interval simulation: Raising the level of abstraction in architectural simulation , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[9]  Thomas F. Wenisch,et al.  TurboSMARTS: accurate microarchitecture simulation sampling in minutes , 2005, SIGMETRICS '05.

[10]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[11]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Thomas M. Conte,et al.  Combining Trace Sampling with Single Pass Methods for Efficient Cache Simulation , 1998, IEEE Trans. Computers.

[13]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[14]  Martin Schulz,et al.  Characterizing and mitigating work time inflation in task parallel programs , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Thomas Grass,et al.  Evaluating Execution Time Predictability of Task-Based Programs on Multi-Core Processors , 2014, Euro-Par Workshops.

[16]  Nectarios Koziris,et al.  Understanding the Performance of Sparse Matrix-Vector Multiplication , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[17]  Mateo Valero,et al.  Simulating Whole Supercomputer Applications , 2011, IEEE Micro.

[18]  Thomas F. Wenisch,et al.  Simulation sampling with live-points , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[19]  Thomas F. Wenisch,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, ISCA '03.

[20]  David R. Butenhof Programming with POSIX threads , 1993 .

[21]  Lieven Eeckhout,et al.  Sampled simulation of multi-threaded applications , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[22]  Jesús Labarta,et al.  Extracting the optimal sampling frequency of applications using spectral analysis , 2011, Concurr. Comput. Pract. Exp..

[23]  Laxmikant V. Kalé,et al.  Work stealing and persistence-based load balancers for iterative overdecomposed applications , 2012, HPDC '12.

[24]  Paolo Faraboschi,et al.  COTSon: infrastructure for full system simulation , 2009, OPSR.

[25]  Mateo Valero,et al.  On the simulation of large-scale architectures using multiple application abstraction levels , 2012, TACO.

[26]  Lieven Eeckhout,et al.  BLRL: Accurate and Efficient Warmup for Sampled Processor Simulation , 2005, Comput. J..

[27]  Christoforos E. Kozyrakis,et al.  ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[28]  Eduard Ayguadé,et al.  Runtime-Aware Architectures: A First Approach , 2014, Supercomput. Front. Innov..

[29]  Mateo Valero,et al.  Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC? , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[30]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[31]  Mateo Valero,et al.  Runtime-Aware Architectures , 2015, Euro-Par.