Instant profiling: Instrumentation sampling for profiling datacenter applications

Profile-guided optimization possesses huge potential to save costs for datacenters. Hardware performance monitoring units enable profiling with negligible overhead and they have been proven to be effective to help programmers find code regions to optimize by monitoring datacenter applications continuously on live traffic. However, these hardware features are inflexible and often buggy, limiting the types of data that can be gathered. Instrumentation-based profiling can complement or replace hardware functionality by providing more flexible and targeted information gathering. Unfortunately, the overhead of existing instrumentation mechanisms prevents their use in production runs. In order to be used in datacenters, we need a profiling mechanism to impose overheads of less than a few percent, in terms of both throughput and latency, while still generating meaningful profile data. This paper presents instant profiling, an instrumentation sampling technique using dynamic binary translation. Instead of instrumenting the entire execution, instant profiling periodically interleaves native execution and instrumented execution according to configurable profiling duration and frequency parameters. It further reduces the latency degradation of initial profiling phases by pre-populating a software code cache. We evaluate the performance and effectiveness of this new profiling technique on the SPEC CINT2006 benchmark suite and two datacenter application benchmarks. We show that it is well-suited for deployment to datacenters by incurring less than 6% slowdown and 3% computational overhead on average.

[1]  Alexandre E. Eichenberger,et al.  Efficient edge profiling for ILP-processors , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[2]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[3]  Matthew Arnold,et al.  A framework for reducing the cost of instrumented code , 2001, PLDI '01.

[4]  Wenguang Chen,et al.  Taming hardware event samples for FDO compilation , 2010, CGO '10.

[5]  David Black-Schaffer,et al.  Phase guided profiling for fast cache modeling , 2012, CGO '12.

[6]  Thomas Ball,et al.  Edge profiling versus path profiling: the showdown , 1998, POPL '98.

[7]  Derek Bruening,et al.  Efficient, transparent, and comprehensive runtime code manipulation , 2004 .

[8]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[9]  Y. N. Srikant,et al.  A programmable hardware path profiler , 2005, International Symposium on Code Generation and Optimization.

[10]  Burzin A. Patel,et al.  Using branch handling hardware to support profile-driven optimization , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[12]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[13]  David W. Wall,et al.  Predicting program behavior using real or estimated profiles , 2004, SIGP.

[14]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[15]  Michael D. Smith,et al.  Ephemeral Instrumentation for Lightweight Program Profiling , 1997 .

[16]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[17]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[18]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[19]  John C. Gyllenhaal,et al.  A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization , 1999, ISCA.

[20]  Dirk Grunwald,et al.  Shadow Profiling: Hiding Instrumentation Costs with Parallelism , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[21]  Qin Zhao,et al.  Pipa: pipelined profiling and analysis on multi-core systems , 2008, CGO 2008.

[22]  Martin Hirzel,et al.  Dynamic hot data stream prefetching for general-purpose programs , 2002, PLDI '02.

[23]  Brad Calder,et al.  Value Profiling and Optimization , 1999, J. Instr. Level Parallelism.

[24]  Kim M. Hazelwood,et al.  SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[25]  Martin Hirzel,et al.  Bursty Tracing: A Framework for Low-Overhead Temporal Profiling , 2001 .