AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers

The large instruction working sets of private and public cloud workloads lead to frequent instruction cache misses and costs in the millions of dollars. While prior work has identified the growing importance of this problem, to date, there has been little analysis of where the misses come from, and what the opportunities are to improve them. To address this challenge, this paper makes three contributions. First, we present the design and deployment of a new, always-on, fleet-wide monitoring system, AsmDB, that tracks front-end bottlenecks. AsmDB uses hardware support to collect bursty execution traces, fleet-wide temporal and spatial sampling, and sophisticated offline post-processing to construct full-program dynamic control-flow graphs. Second, based on a longitudinal analysis of AsmDB data from real-world online services, we present two detailed insights on the sources of front-end stalls: (1) cold code that is brought in along with hot code leads to significant cache fragmentation and a corresponding large number of instruction cache misses; (2) distant branches and calls that are not amenable to traditional cache locality or next-line prefetching strategies account for a large fraction of cache misses. Third, we prototype two optimizations that target these insights. For misses caused by fragmentation, we focus on memcmp, one of the hottest functions contributing to cache misses, and show how fine-grained layout optimizations lead to significant benefits. For misses at the targets of distant jumps, we propose new hardware support for software code prefetching and prototype a new feedback-directed compiler optimization that combines static program flow analysis with dynamic miss profiles to demonstrate significant benefits for several large warehouse-scale workloads. Improving upon prior work, our proposal avoids invasive hardware modifications by prefetching via software in an efficient and scalable way. Simulation results show that such an approach can eliminate up to 96% of instruction cache misses with negligible overheads.

[1]  Christoforos E. Kozyrakis,et al.  ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[2]  Parthasarathy Ranganathan,et al.  The Datacenter as a Computer: Designing Warehouse-Scale Machines, Third Edition , 2018, The Datacenter as a Computer.

[3]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[6]  Babak Falsafi,et al.  Confluence: Unified instruction supply for scale-out servers , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Cheng-Chieh Huang,et al.  Boomerang: A Metadata-Free Architecture for Control Flow Delivery , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[8]  Thomas F. Wenisch,et al.  Temporal instruction fetch streaming , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[9]  Boris Grot,et al.  Blasting through the Front-End Bottleneck with Shotgun , 2018, ASPLOS.

[10]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[11]  Todd C. Mowry,et al.  Cooperative prefetching: compiler and hardware support for effective instruction prefetching in modern processors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[12]  Guilherme Ottoni,et al.  BOLT: A Practical Binary Optimizer for Data Centers and Beyond , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[13]  T. K. Prakash,et al.  Performance Characterization of SPEC CPU 2006 Benchmarks on Intel Core 2 Duo Processor , .

[14]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[15]  David A. Patterson,et al.  Technical perspective: the data center is the computer , 2008, CACM.

[16]  Derek Bruening,et al.  Efficient, transparent, and comprehensive runtime code manipulation , 2004 .

[17]  Tipp Moseley,et al.  AutoFDO: Automatic feedback-directed optimization for warehouse-scale applications , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[18]  Robert Hundt,et al.  Loop Recognition in C++/Java/Go/Scala , 2011 .

[19]  Thomas F. Wenisch,et al.  RDIP: Return-address-stack Directed Instruction Prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Glenn Reinman,et al.  Fetch directed instruction prefetching , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[21]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[22]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[23]  Ahmad Yasin,et al.  A Top-Down method for performance analysis and counters architecture , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[24]  Jeffrey Dean,et al.  Transparent, low-overhead profiling on modern processors , 1998 .

[25]  Babak Falsafi,et al.  SHIFT: Shared history instruction fetch for lean-core server processors , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Robert S. Cohn,et al.  Hot cold optimization of large Windows/NT applications , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[27]  Lori L. Pollock,et al.  A Region-based Partial Inlining Algorithm for an ILP Optimizing Compiler , 2002, PDPTA.

[28]  Daniel Sánchez,et al.  Tailbench: a benchmark suite and evaluation methodology for latency-critical applications , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[29]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[30]  Paul Havlak,et al.  Nesting of reducible and irreducible loops , 1997, TOPL.

[31]  AilamakiAnastasia,et al.  Clearing the clouds , 2012 .

[32]  Babak Falsafi,et al.  Proactive instruction fetch , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]  Stéphane Eranian What can performance counters do for memory subsystem analysis? , 2008, MSPC '08.

[34]  Martin Fleury,et al.  Software-Controlled Instruction Prefetch Buffering for Low-End Processors , 2015, J. Circuits Syst. Comput..

[35]  Chunjie Luo,et al.  Characterizing data analysis workloads in data centers , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[36]  Christoforos E. Kozyrakis,et al.  Memory Hierarchy for Web Search , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).