AutoFDO: Automatic feedback-directed optimization for warehouse-scale applications

AutoFDO is a system to simplify real-world deployment of feedback-directed optimization (FDO). The system works by sampling hardware performance monitors on production machines and using those profiles to guide optimization. Profile data is stale by design, and we have implemented compiler features to deliver stable speedup across releases. The resulting performance has a geometric mean improvement of 10.5. The system is deployed to hundreds of binaries at Google, and it is extremely easy to enable; users need only to add some flags to their release build. To date, AutoFDO has increased the number of FDO users at Google by 8X and has doubled the number of cycles spent in FDO-optimized binaries. Over half of CPU cycles used are now spent in some flavor of FDO-optimized binaries.

[1]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[2]  Wenguang Chen,et al.  Taming Hardware Event Samples for Precise and Versatile Feedback Directed Optimizations , 2013, IEEE Transactions on Computers.

[3]  Gadi Haber,et al.  Complementing Missing and Inaccurate Profiling Using a Minimum Cost Circulation Algorithm , 2008, HiPEAC.

[4]  Zheng Wang,et al.  System support for automatic profiling and optimization , 1997, SOSP.

[5]  John C. Gyllenhaal,et al.  A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization , 1999, ISCA.

[6]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[7]  Wenguang Chen,et al.  Taming hardware event samples for FDO compilation , 2010, CGO '10.

[8]  Michael J. Eager Introduction to the DWARF Debugging Format , 2007 .

[9]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[10]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[11]  David Xinliang Li,et al.  Lightweight feedback-directed cross-module optimization , 2010, CGO '10.

[12]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[13]  Jeffrey Dean,et al.  ProfileMe: hardware support for instruction-level profiling on out-of-order processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[14]  Zheng Wang,et al.  BMAT - A Binary Matching Tool for Stale Profile Propagation , 2000, J. Instr. Level Parallelism.

[15]  Michael D. Smith,et al.  Ephemeral Instrumentation for Lightweight Program Profiling , 1997 .

[16]  Burzin A. Patel,et al.  Hardware-Based Profiling: An Effective Technique for Profile-Driven Optimization , 1996, International Journal of Parallel Programming.

[17]  James R. Larus,et al.  Optimally profiling and tracing programs , 1992, POPL '92.

[18]  Bo Wu,et al.  Profmig: A framework for flexible migration of program profiles across software versions , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Zheng Wang,et al.  Profile-Based Optimization with Statistical Profiles , 1997 .

[21]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[22]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[23]  Lorena Pesantez,et al.  IBM POWER8 performance features and evaluation , 2015, IBM J. Res. Dev..