Non-determinism and overcount on modern hardware performance counter implementations

Ideal hardware performance counters provide exact deterministic results. Real-world performance monitoring unit (PMU) implementations do not always live up to this ideal. Events that should be exact and deterministic (such as retired instructions) show run-to-run variation and overcount on ×86_64 machines, even when run in strictly controlled environments. These effects are non-intuitive to casual users and cause difficulties when strict determinism is desirable, such as when implementing deterministic replay or deterministic threading libraries. We investigate eleven different x86 64 CPU implementations and discover the sources of divergence from expected count totals. Of all the counter events investigated, we find only a few that exhibit enough determinism to be used without adjustment in deterministic execution environments. We also briefly investigate ARM, IA64, POWER and SPARC systems and find that on these platforms the counter events have more determinism. We explore various methods of working around the limitations of the ×86_64 events, but in many cases this is not possible and would require architectural redesign of the underlying PMU.

[1]  Matthias Hauswirth,et al.  Accuracy of performance counter measurements , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[2]  Mikko H. Lipasti,et al.  Can trace-driven simulators accurately predict superscalar performance? , 1996, Proceedings International Conference on Computer Design. VLSI in Computers and Processors.

[3]  Nicholas Mc Guire,et al.  Analysis of Inherent Randomness of the Linux kernel , 2009 .

[4]  David A. Wood,et al.  Variability in architectural simulations of multi-threaded workloads , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[5]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[6]  Luiz De Rose The Hardware Performance Monitor Toolkit , 2001, Euro-Par.

[7]  Sally A. McKee,et al.  Code density concerns for new architectures , 2009, 2009 IEEE International Conference on Computer Design.

[8]  Sally A. McKee,et al.  Can hardware performance counters be trusted? , 2008, 2008 IEEE International Symposium on Workload Characterization.

[9]  Luis Ceze,et al.  Deterministic Process Groups in dOS , 2010, OSDI.

[10]  Patricia J. Teller,et al.  Just how accurate are performance counters? , 2001, Conference Proceedings of the 2001 IEEE International Performance, Computing, and Communications Conference (Cat. No.01CH37210).

[11]  Matthias Hauswirth,et al.  Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.

[12]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[13]  Sally A. McKee,et al.  Using Dynamic Binary Instrumentation to Generate Multi-platform SimPoints: Methodology and Accuracy , 2008, HiPEAC.

[14]  Sally A. McKee,et al.  Using dynamic binary instrumentation to create faster, validated, multi-core simulations , 2010 .

[15]  Patricia J. Teller,et al.  Accuracy of Performance Monitoring Hardware , 2002 .

[16]  Vincent M. Weaver,et al.  Are Cycle Accurate Simulations a Waste of Time? , 2008 .

[17]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX Annual Technical Conference, FREENIX Track.

[18]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[19]  Sen Hu,et al.  Efficient system-enforced deterministic parallelism , 2010, OSDI.

[20]  S. Eranian Perfmon2: a flexible performance monitoring interface for Linux , 2010 .

[21]  Carsten Trinitis,et al.  Hardware Instruction Counting for Log-Based Rollback Recovery on x86-Family Processors , 2006, ISAS.

[22]  Marek Olszewski,et al.  Kendo: efficient deterministic multithreading in software , 2009, ASPLOS.

[23]  Wenguang Chen,et al.  Taming hardware event samples for FDO compilation , 2010, CGO '10.