System execution path profiling using hardware performance counters

The task critical execution path, obtained from a kernel trace, reports the time spent waiting for each task involved in a heterogeneous and distributed application. However, additional profiling is needed to understand and identify the problematic code associated with long-lasting path edges. Hardware counter sampling provides insight on software performance at the microarchitecture level, for instance extracting the call stack every 100K execution cycles to understand where the execution time is spent. Similarly, extracting the call stack at the end of a long waiting system call is often useful. This technique is readily available for either statically or JIT compiled code. However, interpreted code is indirectly executed on the processor and the link between the statements and the executed assembly is missing. We propose an architecture to efficiently record call stacks along the execution path, including interpreted programs, in a low intrusive way that maintains the abstraction boundary between the kernel, the interpreter, and the user code. The method consists in sending a signal from within the performance counter interrupt handler. The user-space code receiving the signal can inspect and record the state of the program. We implemented a profiler for the CPython interpreter using this technique. We studied the benefit, the accuracy, and the cost of the proposed technique compared to an all-kernel monitoring solution.

[1]  Filip Nybäck Improving the support for ARM in the IgProf profiler , 2014 .

[2]  M. Desnoyers,et al.  Combined Tracing of the Kernel and Applications with LTTng , 2010 .

[3]  Shirley Moore A Comparison of Counting and Sampling Modes of Using Performance Monitoring Hardware , 2002, International Conference on Computational Science.

[4]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .

[5]  Nathan Froyd,et al.  Low-overhead call path profiling of unmodified, optimized code , 2005, ICS '05.

[6]  Jonathan Walpole,et al.  User-Level Implementations of Read-Copy Update , 2012, IEEE Transactions on Parallel and Distributed Systems.

[7]  Jeffrey K. Hollingsworth,et al.  Using Hardware Counters to Automatically Improve Memory Performance , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[8]  Michael Stumm,et al.  Online performance analysis by statistical sampling of microprocessor performance counters , 2005, ICS '05.

[9]  Jack Dongarra,et al.  Using PAPI for Hardware Performance Monitoring on Linux Systems , 2001 .

[10]  Jon Louis Bentley,et al.  Writing efficient programs , 1982 .

[11]  Jeffrey K. Hollingsworth An online computation of critical path profiling , 1996, SPDT '96.

[12]  Michel Dagenais,et al.  Accurate offline synchronization of distributed traces using kernel-level events , 2010, OPSR.

[13]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[14]  Peter Hofer,et al.  Fast Java profiling with scheduling-aware stack fragment sampling and asynchronous analysis , 2014, PPPJ '14.

[15]  Andrew M. Kuhn,et al.  Code Complete , 2005, Technometrics.

[16]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.