Memory Hierarchy for Web Search

Online data-intensive services, such as search, serve billions of users, utilize millions of cores, and comprise a significant and growing portion of datacenter-scale workloads. However, the complexity of these workloads and their proprietary nature has precluded detailed architectural evaluations and optimizations of processor design trade-offs. We present the first detailed study of the memory hierarchy for the largest commercial search engine today. We use a combination of measurements from longitudinal studies across tens of thousands of deployed servers, systematic microarchitectural evaluation on individual platforms, validated trace-driven simulation, and performance modeling – all driven by production workloads servicing real-world user requests. Our data quantifies significant differences between production search and benchmarks commonly used in the architecture community. We identify the memory hierarchy as an important opportunity for performance optimization, and present new insights pertaining to how search stresses the cache hierarchy, both for instructions and data. We show that, contrary to conventional wisdom, there is significant reuse of data that is not captured by current cache hierarchies, and discuss why this precludes state-of-the-art tiled and scale-out architectures. Based on these insights, we rethink a new cache hierarchy optimized for search that trades off the inefficient use of L3 cache transistors for higher-performance cores, and adds a latency-optimized on-package eDRAM L4 cache. Compared to state-of-the-art processors, our proposed design performs 27% to 38% better.

[1]  John Paul Shen,et al.  Hardware Support for Prescient Instruction Prefetch , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[2]  Cheng-Chieh Huang,et al.  ATCache: Reducing DRAM cache latency via a small SRAM tag cache , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[3]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[4]  Babak Falsafi,et al.  Fat Caches for Scale-Out Servers , 2017, IEEE Micro.

[5]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[6]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[7]  Kushagra Vaid,et al.  Web search using mobile cores: quantifying and mitigating the price of efficiency , 2010, ISCA.

[8]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[9]  Jeffrey Dean,et al.  Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.

[10]  Todd C. Mowry,et al.  Cooperative prefetching: compiler and hardware support for effective instruction prefetching in modern processors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[11]  Christoforos E. Kozyrakis,et al.  Towards energy proportionality for large-scale latency-critical workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[12]  Gabriel H. Loh,et al.  A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[13]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[14]  Babak Falsafi,et al.  Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[15]  Babak Falsafi,et al.  Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[16]  Pradeep Dubey,et al.  Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs , 2009, Proc. VLDB Endow..

[17]  Sriram Sankar,et al.  Server Engineering Insights for Large-Scale Online Services , 2010, IEEE Micro.

[18]  Babak Falsafi,et al.  Scale-out NUMA , 2014, ASPLOS.

[19]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[20]  William J. Bowhill,et al.  The Xeon® Processor E5-2600 v3: a 22 nm 18-Core Product Family , 2016, IEEE Journal of Solid-State Circuits.

[21]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[22]  Kevin Zhang,et al.  2nd generation embedded DRAM with 4X lower self refresh power in 22nm Tri-Gate CMOS technology , 2014, 2014 Symposium on VLSI Circuits Digest of Technical Papers.

[23]  Babak Falsafi,et al.  Scale-out processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[24]  Babak Falsafi,et al.  Proactive instruction fetch , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Jung Ho Ahn,et al.  A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies , 2008, 2008 International Symposium on Computer Architecture.

[26]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[27]  Mark D. Hill,et al.  Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition , 2013, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition.

[29]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[30]  Ahmad Yasin,et al.  A Top-Down method for performance analysis and counters architecture , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[31]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[32]  Allison Woodruff,et al.  An Investigation of Documents from the World Wide Web , 1996, Comput. Networks.

[33]  Hari Angepat,et al.  A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34]  Wilfred Gomes,et al.  5.9 Haswell: A family of IA 22nm processors , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[35]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[36]  Babak Falsafi,et al.  SHIFT: Shared history instruction fetch for lean-core server processors , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  Francisco J. Cazorla,et al.  Making data prefetch smarter: Adaptive prefetching on POWER7 , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[38]  Carole-Jean Wu,et al.  PACMan: Prefetch-Aware Cache Management for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[39]  Josep Torrellas,et al.  The memory performance of DSS commercial workloads in shared-memory multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[40]  Onur Mutlu,et al.  Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management , 2012, IEEE Computer Architecture Letters.

[41]  Gabriel H. Loh,et al.  Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[42]  Bruce Jacob,et al.  Technology comparison for large last-level caches (L3Cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[43]  Christoforos E. Kozyrakis,et al.  Towards energy-proportional datacenter memory with mobile DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[44]  Trevor N. Mudge,et al.  Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments , 2008, 2008 International Symposium on Computer Architecture.

[45]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[46]  Thomas F. Wenisch,et al.  RDIP: Return-address-stack Directed Instruction Prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47]  Stéphan Jourdan,et al.  Haswell: The Fourth-Generation Intel Core Processor , 2014, IEEE Micro.

[48]  David Daly,et al.  The cache and memory subsystems of the IBM POWER8 processor , 2015, IBM J. Res. Dev..

[49]  Thomas F. Wenisch,et al.  Power management of online data-intensive services , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[50]  Gu-Yeon Wei,et al.  Profiling a Warehouse-Scale Computer , 2016, IEEE Micro.

[51]  Ronald G. Dreslinski,et al.  Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers , 2015, ASPLOS.