Pinpointing performance inefficiencies via lightweight variance profiling

Execution variance among different invocation instances of the same procedure is often an indicator of performance losses. On the one hand, instrumentation-based tools can insert calipers around procedures and identify execution variance; however, they can introduce high overheads. On the other hand, sampling-based tools insert no instrumentation and have low overheads; however, they cannot synchronize samples with procedure entry and exit. In this paper, we propose FVSampler, a lightweight, sampling-based variance profiler. FVSampler employs hardware performance monitoring units in conjunction with hardware debug registers to sample and monitor whole procedure instances (invocation till return) and collect hardware metrics in each sampled procedure instance. FVSampler, typically, incurs only 6% runtime overhead and negligible memory overhead making it suitable for HPC-scale production codes. We evaluate FVSampler with several parallel applications and demonstrate its effectiveness in pinpointing execution variance. Guided by FVSampler, we tune data structures and algorithms to obtain significant speedups.

[1]  George Candea,et al.  Efficient Tracing of Cold Code via Bias-Free Sampling , 2014, USENIX Annual Technical Conference.

[2]  Shasha Wen,et al.  Featherlight on-the-fly false-sharing detection , 2018, PPOPP.

[3]  Thu D. Nguyen,et al.  Exploiting Heterogeneity for Tail Latency and Energy Efficiency , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Luiz De Rose,et al.  Cray Performance Analysis Tools , 2008, Parallel Tools Workshop.

[5]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[6]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[7]  E. Tammaru,et al.  Guidelines for creating a debuggable processor , 1982, ASPLOS I.

[8]  Martin Schulz,et al.  Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[9]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[10]  Robert J. Fowler,et al.  HPCVIEW: A Tool for Top-down Analysis of Node Performance , 2002, The Journal of Supercomputing.

[11]  SchulzMartin,et al.  Open|SpeedShop: An open source infrastructure for parallel performance analysis , 2008 .

[12]  Derek Bruening,et al.  Efficient, transparent, and comprehensive runtime code manipulation , 2004 .

[13]  John Byrne,et al.  Watching for Software Inefficiencies with Witch , 2018, ASPLOS.

[14]  Xin Liu,et al.  A Highly Effective Global Surface Wave Numerical Simulation with Ultra-High Resolution , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Robert Tappan Morris,et al.  Locating cache performance bottlenecks using data profiling , 2010, EuroSys '10.

[16]  Hao Xu,et al.  Can we trust profiling results?: understanding and fixing the inaccuracy in modern profilers , 2019, ICS.

[17]  Bernd Mohr,et al.  The Scalasca performance toolset architecture , 2010, Concurr. Comput. Pract. Exp..

[18]  Ronald G. Dreslinski,et al.  Reining in Long Tails in Warehouse-Scale Computers with Quick Voltage Boosting Using Adrenaline , 2017, ACM Trans. Comput. Syst..

[19]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[20]  Mark Scott Johnson Some requirements for architectural support of software debugging , 1982, ASPLOS I.

[21]  Gregory R. Ganger,et al.  Automated Diagnosis Without Predictability Is a Recipe for Failure , 2012, HotCloud.

[22]  Thomas F. Wenisch,et al.  Statistical Analysis of Latency Through Semantic Profiling , 2017, EuroSys.

[23]  Hwanju Kim,et al.  TPC: Target-Driven Parallelism Combining Prediction and Correction to Reduce Tail Latency in Interactive Services , 2016, ASPLOS.

[24]  John M. Mellor-Crummey,et al.  A tool to analyze the performance of multithreaded programs on NUMA architectures , 2014, PPoPP '14.

[25]  Milind Chabbi,et al.  Pinpointing performance inefficiencies in Java , 2019, ESEC/SIGSOFT FSE.

[26]  Wenguang Chen,et al.  DRDDR: a lightweight method to detect data races in Linux kernel , 2016, The Journal of Supercomputing.

[27]  Barzan Mozafari,et al.  DBSherlock: A Performance Diagnostic Tool for Transactional Databases , 2016, SIGMOD Conference.

[28]  Xu Liu,et al.  Featherlight Reuse-Distance Measurement , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[29]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[30]  Nathan R. Tallent,et al.  Binary analysis for measurement and attribution of program performance , 2009, PLDI '09.

[31]  Sebastian Burckhardt,et al.  Effective Data-Race Detection for the Kernel , 2010, OSDI.

[32]  Emery D. Berger,et al.  Coz: finding code that counts with causal profiling , 2015, USENIX Annual Technical Conference.

[33]  Nathan R. Tallent,et al.  Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[35]  B. Welford Note on a Method for Calculating Corrected Sums of Squares and Products , 1962 .

[36]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[37]  Xiaoyin Wang,et al.  CSOD: Context-Sensitive Overflow Detection , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[38]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[39]  Martin Schulz,et al.  Reconciling Sampling and Direct Instrumentation for Unintrusive Call-Path Profiling of MPI Programs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[40]  Martin Schulz,et al.  Open | SpeedShop: An open source infrastructure for parallel performance analysis , 2008, Sci. Program..

[41]  John Mellor-Crummey,et al.  Managing locality in grand challenge applications: a case study of the gyrokinetic toroidal code , 2008 .

[42]  Emery D. Berger,et al.  DoubleTake: Fast and Precise Error Detection via Evidence-Based Dynamic Analysis , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[43]  Balaram Sinharoy,et al.  IBM POWER7 performance modeling, verification, and evaluation , 2011 .

[44]  Ricardo Bianchini,et al.  Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services , 2015, ASPLOS.

[45]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[46]  Minming Li,et al.  TailCutter: Wisely cutting tail latency in cloud CDN under cost constraints , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.