Rapid profiling via stratified sampling

Sophisticated binary translators and dynamic optimizers demand a program profiler with low overhead, high accuracy, and the ability to collect a variety of profile types. A profiling scheme that achieves these goals is proposed. Conceptually, the hardware compresses a stream of profile data by counting identical events; the compressed profile dam is passed to software for analysis. Compressing the high-bandwidth event stream greatly reduces software overhead. Because optimizations can tolerate some profiling errors, we allow the stream compressor to be lossy, thereby enabling a low-cost sampling-based hardware design. Because the hardware compressor is insensitive to the event content, it supports various profile types and can process multiple types simultaneously. Basic components of our framework are periodic and random samplers, counters, and hash functions. These components are composed to form a variety of stream compressors. One design is both simple and very effective: the input stream is hash-split into multiple substreams, each of which is fed into a simple periodic sampler that selects every kth event. This stratified periodic sampler performs better than conventional random sampling because it biases each substream towards a small number of unique events, thereby reducing sampling error, and allowing faster convergence to an accurate profile. For example, convergence to a given level of accuracy is about twice as fast for gcc. When sampling overhead is considered, the stratified periodic profiler achieves less than 3% error while incurring an overhead of only 3.5% for gcc.

[1]  James R. Larus,et al.  Optimally profiling and tracing programs , 1994, TOPL.

[2]  Brad Calder,et al.  Value profiling , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[3]  Erik Ruf,et al.  Data specialization , 1996, PLDI '96.

[4]  Scott A. Mahlke,et al.  Using profile information to assist classic code optimizations , 1991, Softw. Pract. Exp..

[5]  Erik R. Altman,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[6]  Carole Dulong,et al.  The IA-64 Architecture at Work , 1998, Computer.

[7]  Gurindar S. Sohi,et al.  A programmable co-processor for profiling , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[8]  Thomas M. Conte,et al.  Accurate and practical profile-driven compilation using the profile buffer , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[9]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[10]  E. Duesterwald,et al.  Software profiling for hot path prediction: less is more , 2000, SIGP.

[11]  James E. Smith,et al.  Relational profiling: enabling thread-level parallelism in virtual machines , 2000, MICRO 33.

[12]  Vivek Sarkar,et al.  The Jalape ~ no Dynamic Optimizing Compiler for Java TM , 1999 .

[13]  A. Klaiber The Technology Behind Crusoe TM Processors Low-power x 86-Compatible Processors Implemented with Code Morphing , 2000 .

[14]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[15]  Markus Mock,et al.  DyC: an expressive annotation-directed dynamic compiler for C , 2000, Theor. Comput. Sci..

[16]  Scott A. Mahlke,et al.  Dynamic memory disambiguation using the memory conflict buffer , 1994, ASPLOS VI.

[17]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[18]  Urs Hölzle,et al.  Adaptive optimization for self: reconciling high performance with exploratory programming , 1994 .

[19]  Peter Sestoft,et al.  Partial evaluation and automatic program generation , 1993, Prentice Hall international series in computer science.

[20]  Jian Huang,et al.  Exploiting basic block value locality with block reuse , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[21]  Jacques Noyé,et al.  A Uniform Approach for Compile-Time and Run-Time Specialization , 1996, Dagstuhl Seminar on Partial Evaluation.

[22]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[23]  J.E. Smith,et al.  Achieving high performance via co-designed virtual machines , 1998, Innovative Architecture for Future Generation High-Performance Processors and Systems.

[24]  Markus Mock,et al.  Calpa: a tool for automating selective dynamic compilation , 2000, MICRO 33.

[25]  W. E. Weihl,et al.  Efficient and flexible value sampling , 2000, SIGP.

[26]  Brad Calder,et al.  Time Varying Behavior of Programs , 1999 .

[27]  K. Ebcioglu,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[28]  Matthew Arnold,et al.  Adaptive optimization in the Jalapeño JVM , 2000, OOPSLA '00.

[29]  Wen-mei W. Hwu,et al.  Compiler-directed dynamic computation reuse: rationale and initial results , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[30]  Craig Chambers,et al.  Optimizing Dynamically-Typed Object-Oriented Languages With Polymorphic Inline Caches , 1991, ECOOP.

[31]  Saumya K. Debray,et al.  Code Specialization Based on Value Profiles , 2000, SAS.

[32]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[33]  G.S. Sohi,et al.  Dynamic Instruction Reuse , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[34]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[35]  Jeffrey Dean,et al.  ProfileMe: hardware support for instruction-level profiling on out-of-order processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[36]  John C. Gyllenhaal,et al.  A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization , 1999, ISCA.

[37]  Michael D. Smith,et al.  Better global scheduling using path profiles , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[38]  Scott A. Mahlke,et al.  Profile‐guided automatic inline expansion for C programs , 1992, Softw. Pract. Exp..

[39]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[40]  Dawson R. Engler,et al.  tcc: a system for fast, flexible, and high-level dynamic code generation , 1997, PLDI '97.

[41]  A. Winsor Sampling techniques. , 2000, Nursing times.

[42]  Antonio González,et al.  Trace-level reuse , 1999, Proceedings of the 1999 International Conference on Parallel Processing.