wPerf: Generic Off-CPU Analysis to Identify Bottleneck Waiting Events

This paper tries to identify waiting events that limit the maximal throughput of a multi-threaded application. To achieve this goal, we not only need to understand an event’s impact on threads waiting for this event (i.e., local impact), but also need to understand whether its impact can reach other threads that are involved in request processing (i.e., global impact). To address these challenges, wPerf computes the local impact of a waiting event with a technique called cascaded re-distribution; more importantly, wPerf builds a waitfor graph to compute whether such impact can indirectly reach other threads. By combining these two techniques, wPerf essentially tries to identify events with large impacts on all threads. We apply wPerf to a number of open-source multithreaded applications. By following the guide of wPerf, we are able to improve their throughput by up to 4.83×. The overhead of recording waiting events at runtime is about 5.1% on average.

[1]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[2]  Emery D. Berger,et al.  Coz: finding code that counts with causal profiling , 2015, USENIX Annual Technical Conference.

[3]  Rodrigo Fonseca,et al.  Pivot tracing , 2018, USENIX ATC.

[4]  Julia L. Lawall,et al.  Continuously measuring critical section pressure with the free-lunch profiler , 2014, OOPSLA.

[5]  Guangming Zeng,et al.  SyncPerf: Categorizing, Detecting, and Diagnosing Synchronization Performance Bugs , 2017, EuroSys.

[6]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[7]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[8]  Shan Lu,et al.  Pcatch: automatically detecting performance cascading bugs in cloud systems , 2018, EuroSys.

[9]  Zhiqiang Ma,et al.  Ad Hoc Synchronization Considered Harmful , 2010, OSDI.

[10]  Yu Luo,et al.  Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle , 2016, OSDI.

[11]  Barton P. Miller,et al.  IPS-2: The Second Generation of a Parallel Program Measurement System , 1990, IEEE Trans. Parallel Distributed Syst..

[12]  Tingting Yu,et al.  SyncProf: detecting, localizing, and optimizing synchronization bottlenecks , 2016, ISSTA.

[13]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[14]  Kaushik Veeraraghavan,et al.  Canopy: An End-to-End Performance Tracing And Analysis System , 2017, SOSP.

[15]  J. Flinn,et al.  Automatic Root-cause Diagnosis of Performance Anomalies in Production Software , 2011 .

[16]  Felix Wolf,et al.  Space-efficient time-series call-path profiling of parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[17]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[18]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[19]  G ValiantLeslie A bridging model for parallel computation , 1990 .

[20]  Michael Stumm,et al.  FlexSC: Flexible System Call Scheduling with Exception-Less System Calls , 2010, OSDI.

[21]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[22]  James E. Kelley,et al.  Critical-Path Planning and Scheduling: Mathematical Basis , 1961 .

[23]  Austin T. Clements,et al.  The scalable commutativity rule: designing scalable software for multicore processors , 2013, SOSP.

[24]  Thomas F. Wenisch,et al.  Statistical Analysis of Latency Through Semantic Profiling , 2017, EuroSys.

[25]  Onur Mutlu,et al.  Bottleneck identification and scheduling in multithreaded applications , 2012, ASPLOS XVII.

[26]  Konstantin V. Shvachko,et al.  HDFS Scalability: The Limits to Growth , 2010, login Usenix Mag..

[27]  Úlfar Erlingsson,et al.  Fay: extensible distributed tracing from kernels to clusters , 2011, SOSP '11.

[28]  Yang Wang,et al.  Evaluating Scalability Bottlenecks by Workload Extrapolation , 2018, 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[29]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[30]  Yuxiong He,et al.  The Cilkview scalability analyzer , 2010, SPAA '10.

[31]  Akinori Yonezawa,et al.  Online Computation of Critical Paths for Multithreaded Languages , 2000, IPDPS Workshops.

[32]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[33]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[34]  Stephen A. Jarvis,et al.  Portable and architecture independent parallel performance tuning using a call-graph profiling tool , 1997, Proceedings of the Sixth Euromicro Workshop on Parallel and Distributed Processing - PDP '98 -.

[35]  Barton P. Miller,et al.  Slack: A New Performance Metric for Parallel Programs , 2007 .

[36]  Chao Xie,et al.  Salt: Combining ACID and BASE in a Distributed Database , 2014, OSDI.

[37]  Aamer Jaleel,et al.  Analyzing Parallel Programs with PIN , 2010, Computer.

[38]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[39]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[40]  Barton P. Miller,et al.  IPS: An Interactive and Automatic Performance Measurement Tool for Parallel and Distributed Programs , 1987, ICDCS.

[41]  Johannes Gehrke,et al.  Fast Iterative Graph Computation with Block Updates , 2013, Proc. VLDB Endow..

[42]  Saturnino Garcia,et al.  Kremlin: rethinking and rebooting gprof for the multicore age , 2011, PLDI '11.

[43]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[44]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[45]  Richard C. Holt,et al.  Some deadlock properties of computer systems , 1971, SOSP '71.

[46]  Melanie Kambadur,et al.  ParaShares: Finding the Important Basic Blocks in Multithreaded Programs , 2014, Euro-Par.

[47]  GhemawatSanjay,et al.  The Google file system , 2003 .

[48]  Melanie Kambadur,et al.  Harmony: Collection and analysis of parallel block vectors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[49]  M. Frans Kaashoek,et al.  Scaling a file system to many cores using an operation log , 2017, SOSP.

[50]  Johannes Gehrke,et al.  Asynchronous Large-Scale Graph Processing Made Easy , 2013, CIDR.

[51]  Chao Xie,et al.  High-performance ACID via modular concurrency control , 2015, SOSP.

[52]  James E. Kelley,et al.  Critical-path planning and scheduling , 1899, IRE-AIEE-ACM '59 (Eastern).

[53]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.