Hardware-Based Pro ling: An E ective Technique for Pro le-Driven Optimization

Pro le-based optimizations can be used for instruction scheduling, loop scheduling, data preloading, function in-lining, and instruction cache performance enhancement. However, these techniques have not been embraced by software vendors because programs instrumented for pro ling run signi cantly slower, an awkward compile-run-recompile sequence is required, and a test input suite must be collected and validated for each program. This paper introduces hardware-based pro ling that uses traditional branch handling hardware to generate pro le information in real time. Techniques are presented for both one-level and two-level branch hardware organizations. The approach produces high accuracy with small slowdown in execution (0.4%{4.6%). This allows a program to be pro led while it is used, eliminating the need for a test input suite. With contemporary processors driven increasingly by compiler support, hardware-based pro ling is important for high-performance systems.

[1]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[2]  Wen-mei W. Hwu,et al.  Trace Selection For Compiling Large C Application Programs To Microcode , 1988, [1988] Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture - MICRO '21.

[3]  W. W. Hwu,et al.  Achieving high instruction cache performance with an optimizing compiler , 1989, ISCA '89.

[4]  Wen-mei W. Hwu,et al.  Inline function expansion for compiling C programs , 1989, PLDI '89.

[5]  Y. Patt,et al.  Two-level adaptive training branch prediction , 1991, MICRO 24.

[6]  Michael D. Smith,et al.  Tracing with Pixie , 1991 .

[7]  David W. Wall,et al.  Predicting program behavior using real or estimated profiles , 2004, SIGP.

[8]  Scott A. Mahlke,et al.  Using profile information to assist classic code optimizations , 1991, Softw. Pract. Exp..

[9]  James R. Larus,et al.  Optimally profiling and tracing programs , 1992, POPL '92.

[10]  Joseph A. Fisher,et al.  Predicting conditional branch directions from previous runs of a program , 1992, ASPLOS V.

[11]  Jr. William Yu-Wei Chen,et al.  Data preload for superscalar and VLIW processors , 1993 .

[12]  Donald B. Alpert,et al.  Architecture of the Pentium microprocessor , 1993, IEEE Micro.

[13]  James R. Larus,et al.  Branch prediction for free , 1993, PLDI '93.

[14]  Yale N. Patt,et al.  A Comparison Of Dynamic Branch Predictors That Use Two Levels Of Branch History , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[15]  Scott A. Mahlke,et al.  Superblock formation using static program analysis , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[16]  Michael A. Harrison,et al.  Accurate static estimators for program optimization , 1994, PLDI '94.

[17]  S. Peter Song,et al.  The PowerPC 604 RISC microprocessor. , 1994, IEEE Micro.

[18]  James R. Larus,et al.  Rewriting executable files to measure program behavior , 1994, Softw. Pract. Exp..

[19]  D. Grunwald,et al.  Fast & Accurate Instruction Fetch and Branch Prediction , 1994 .

[20]  Scott A. Mahlke,et al.  IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors , 1998, 25 Years ISCA: Retrospectives and Reprints.