Coz: finding code that counts with causal profiling

Improving performance is a central concern for software developers. To locate optimization opportunities, developers rely on software profilers. However, these profilers only report where programs spend their time: optimizing that code may have no impact on performance. Past profilers thus both waste developer time and make it difficult for them to uncover significant optimization opportunities. This paper introduces causal profiling. Unlike past profiling approaches, causal profiling indicates exactly where programmers should focus their optimization efforts, and quantifies their potential impact. Causal profiling works by running performance experiments during program execution. Each experiment calculates the impact of any potential optimization by virtually speeding up code: inserting pauses that slow down all other code running concurrently. The key insight is that this slowdown has the same relative effect as running that line faster, thus "virtually" speeding it up. We present Coz, a causal profiler, which we evaluate on a range of highly-tuned applications such as Memcached, SQLite, and the PARSEC benchmark suite. Coz identifies previously unknown optimization opportunities that are both significant and targeted. Guided by Coz, we improve the performance of Memcached by 9%, SQLite by 25%, and accelerate six PARSEC applications by as much as 68%; in most cases, these optimizations involve modifying under 10 lines of code.

[1]  Raghu Kacker,et al.  Synthetic‐perturbation techniques for screening shared memory programs , 1994, Softw. Pract. Exp..

[2]  Scott A. Mahlke,et al.  Instant profiling: Instrumentation sampling for profiling datacenter applications , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[3]  Nathan R. Tallent,et al.  Effective performance measurement and analysis of multithreaded applications , 2009, PPoPP '09.

[4]  Erik R. Altman,et al.  Performance analysis of idle programs , 2010, OOPSLA.

[5]  David Simchi-Levi,et al.  Introduction to "Little's Law as Viewed on Its 50th Anniversary" , 2011, Oper. Res..

[6]  Barton P. Miller,et al.  Performance Measurement for Parallel and Distributed Programs: A Structured and Automatic Approach , 1989, IEEE Trans. Software Eng..

[7]  Yuxiong He,et al.  The Cilkview scalability analyzer , 2010, SPAA '10.

[8]  Emery D. Berger,et al.  STABILIZER: statistically sound performance evaluation , 2013, ASPLOS '13.

[9]  Julia L. Lawall,et al.  Continuously measuring critical section pressure with the free-lunch profiler , 2014, OOPSLA.

[10]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[11]  Emery D. Berger,et al.  CRAMM: virtual memory support for garbage-collected applications , 2006, OSDI '06.

[12]  Stephen A. Jarvis,et al.  Portable and architecture independent parallel performance tuning using a call-graph profiling tool , 1997, Proceedings of the Sixth Euromicro Workshop on Parallel and Distributed Processing - PDP '98 -.

[13]  Barton P. Miller,et al.  Slack: A New Performance Metric for Parallel Programs , 2007 .

[14]  Matthias Hauswirth,et al.  Evaluating the accuracy of Java profilers , 2010, PLDI '10.

[15]  Xiangyu Zhang,et al.  Alchemist: A Transparent Dependence Distance Profiling Infrastructure , 2009, 2009 International Symposium on Code Generation and Optimization.

[16]  Lars Koesterke,et al.  PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Saturnino Garcia,et al.  Kremlin: rethinking and rebooting gprof for the multicore age , 2011, PLDI '11.

[18]  Felix Wolf,et al.  Space-efficient time-series call-path profiling of parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[19]  Thomas E. Anderson,et al.  Quartz: a tool for tuning parallel program performance , 1990, SIGMETRICS '90.

[20]  Robert J. Hall,et al.  Aspect-Capable Call Path Profiling of Multi-Threaded Java Applications , 2002, Proceedings 17th IEEE International Conference on Automated Software Engineering,.

[21]  Shan Lu,et al.  Statistical debugging for real-world performance problems , 2014, OOPSLA.

[22]  Emery D. Berger,et al.  Archipelago: trading address space for reliability and security , 2008, ASPLOS.

[23]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[24]  Onur Mutlu,et al.  Bottleneck identification and scheduling in multithreaded applications , 2012, ASPLOS XVII.

[25]  Andrea C. Arpaci-Dusseau,et al.  Deconstructing commodity storage clusters , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[26]  Milind Kulkarni,et al.  Towards architecture independent metrics for multicore performance analysis , 2011, PERV.

[27]  Barton P. Miller,et al.  IPS: An Interactive and Automatic Performance Measurement Tool for Parallel and Distributed Programs , 1987, ICDCS.

[28]  Barton P. Miller,et al.  IPS-2: The Second Generation of a Parallel Program Measurement System , 1990, IEEE Trans. Parallel Distributed Syst..

[29]  Akinori Yonezawa,et al.  Online Computation of Critical Paths for Multithreaded Languages , 2000, IPDPS Workshops.

[30]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[31]  Rajesh Bordawekar,et al.  Modeling optimistic concurrency using quantitative dependence analysis , 2008, PPOPP.

[32]  Aamer Jaleel,et al.  Analyzing Parallel Programs with PIN , 2010, Computer.

[33]  Jong-Deok Choi,et al.  Finding and Removing Performance Bottlenecks in Large Systems , 2004, ECOOP.

[34]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[35]  Melanie Kambadur,et al.  ParaShares: Finding the Important Basic Blocks in Multithreaded Programs , 2014, Euro-Par.

[36]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[37]  John D. C. Little,et al.  OR FORUM - Little's Law as Viewed on Its 50th Anniversary , 2011, Oper. Res..

[38]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[39]  Stijn Eyerman,et al.  Bottle graphs: visualizing scalability bottlenecks in multi-threaded applications , 2013, OOPSLA.

[40]  C. Johnson,et al.  In Unix Programmer''s Manual , 1978 .

[41]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[42]  Melanie Kambadur,et al.  Harmony: Collection and analysis of parallel block vectors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[43]  James C. Browne,et al.  Evaluation and optimization of multicore performance bottlenecks in supercomputing applications , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.