Annotation guided collection of context-sensitive parallel execution profiles

Studying the relative behavior of an application’s threads is critical to identifying performance bottlenecks and understanding their root causes. We present context-sensitive parallel (CSP) execution profiles, that capture the relative behavior of threads in terms of the user selected code regions they execute. CSPs can be analyzed to compute execution times spent by the application in interesting behavior states. To capture execution context, code regions of interest can be given static and dynamic names using a versatile set of annotations. The CSP divides the execution time of a multithreaded application into a sequence of time intervals called frames, during which no thread transitions between code regions. By appropriate selection and naming of code regions, the user can obtain a CSP that captures all occurrences of desired behavior states. We provide the user with a powerful query language to facilitate the analysis of CSPs. Our implementation for collection of CSPs of C++ programs has low overhead and high accuracy. Collection of CSPs of full executions of 12 Parsec programs incurred overhead of at most 7% in execution time. The accuracy of CSPs was validated in the context of common performance problems such as load imbalance in pipeline stages and the presence of straggler threads.

[1]  Barton P. Miller,et al.  Critical path analysis for the execution of parallel and distributed programs , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[2]  Julia L. Lawall,et al.  Continuously measuring critical section pressure with the free-lunch profiler , 2014, OOPSLA.

[3]  Melanie Kambadur,et al.  ParaShares: Finding the Important Basic Blocks in Multithreaded Programs , 2014, Euro-Par.

[4]  Rajiv Gupta,et al.  ASPIRE: exploiting asynchronous parallelism in iterative algorithms using a relaxed consistency based DSM , 2014, OOPSLA.

[5]  Hong Linh Truong,et al.  SCALEA: a performance analysis tool for parallel programs , 2003, Concurr. Comput. Pract. Exp..

[6]  Boleslaw K. Szymanski,et al.  Instrumentation database system for performance analysis of parallel scientific applications , 2002, Parallel Comput..

[7]  Saturnino Garcia,et al.  Kismet: parallel speedup estimates for serial programs , 2011, OOPSLA '11.

[8]  Barton P. Miller,et al.  Parallel program performance metrics: a comparison and validation , 1992, Proceedings Supercomputing '92.

[9]  Emery D. Berger,et al.  Coz: finding code that counts with causal profiling , 2015, USENIX Annual Technical Conference.

[10]  Stijn Eyerman,et al.  Bottle graphs: visualizing scalability bottlenecks in multi-threaded applications , 2013, OOPSLA.

[11]  Martin Schulz,et al.  Scalable Critical-Path Based Performance Analysis , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[12]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[13]  Nathan R. Tallent,et al.  Analyzing lock contention in multithreaded applications , 2010, PPoPP '10.

[14]  Xiangyu Zhang,et al.  Annotation Guided Collection of Context-Sensitive Parallel Execution Profiles , 2017, RV.

[15]  Dongmei Zhang,et al.  Comprehending performance from real-world execution traces: a device-driver case , 2014, ASPLOS.

[16]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[17]  Bernd Mohr,et al.  The Scalasca performance toolset architecture , 2010 .

[18]  Jeffrey K. Hollingsworth An online computation of critical path profiling , 1996, SPDT '96.

[19]  Xiang Yuan,et al.  ReCBuLC: Reproducing Concurrency Bugs Using Local Clocks , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[20]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[21]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[22]  Rajiv Gupta,et al.  KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations , 2017, ASPLOS.

[23]  Thomas E. Anderson,et al.  Quartz: a tool for tuning parallel program performance , 1990, SIGMETRICS '90.

[24]  Barton P. Miller,et al.  IPS-2: The Second Generation of a Parallel Program Measurement System , 1990, IEEE Trans. Parallel Distributed Syst..

[25]  Akinori Yonezawa,et al.  Online Computation of Critical Paths for Multithreaded Languages , 2000, IPDPS Workshops.

[26]  Boleslaw K. Szymanski,et al.  Instrumentation Database for Performance Analysis of Parallel Scientific Applications , 1998, LCR.

[27]  Allen D. Malony,et al.  Portable profiling and tracing for parallel, scientific applications using C++ , 1998, SPDT '98.