The Implications of Page Size Management on Graph Analytics

Graph representations of data are ubiquitous in analytic applications. However, graph workloads are notorious for having irregular memory access patterns with variable access frequency per address, which cause high translation lookaside buffer (TLB) miss rates and significant address translation overheads during workload execution. Furthermore, these access patterns sparsely span a large address space, yielding memory footprints greater than total TLB coverage by orders of magnitude. It is widely recognized that employing huge pages can alleviate some of these bottlenecks. However, in real systems, huge pages are not always available and the OS often provisions huge pages suboptimally, significantly reducing peak application performance. State-of-the-art huge page management techniques employ heuristics, such as huge page region utilization, to guide page size decisions. However, these heuristics are often only optimal for specific memory access patterns, or footprint sizes, and do not sufficiently adapt to dynamic workload characteristics.This work performs a comprehensive characterization of the effects of page size allocation policy and page placement on graph application throughput. We show that when system memory is nearly full or fragmented (the common case in real systems), huge page resources available to an application are limited and their utility must be maximized. We demonstrate that (1) awareness of single-use memory can eliminate the use of precious huge page resources for data that receives little benefit and (2) coupling degree-aware preprocessing of graph data with programmer-guided use of huge pages boosts performance by 1.26 – 1.57× over using 4KB pages alone, while achieving 77.3 – 96.3% the performance of unbounded huge page usage and requiring only 0.58 – 2.92% of the memory resources. This manual, domain-specific optimization of huge page efficiency in memory constrained systems demonstrates that huge pages are a new class of resource that must be intelligently managed by programmers or next-generation OS policies to optimize application performance.

[1]  Juan L. Aragón,et al.  Graphfire: Synergizing Fetch, Insertion, and Replacement Policies for Graph Analytics , 2023, IEEE Transactions on Computers.

[2]  Ashish Panwar,et al.  Trident: Harnessing Architectural Resources for All Page Sizes in x86 Processors , 2021, MICRO.

[3]  Juan L. Aragón,et al.  GraphAttack , 2021, ACM Trans. Archit. Code Optim..

[4]  Babak Falsafi,et al.  Rebooting Virtual Memory with Midgard , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[5]  Scott A. Mahlke,et al.  Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design , 2021, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[6]  Quan M. Nguyen,et al.  Pipette: Improving Core Utilization on Irregular Applications through Intra-Core Pipeline Parallelism , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Yale N. Patt,et al.  Tailored Page Sizes , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[8]  Boris Grot,et al.  A Closer Look at Lightweight Graph Reordering , 2019, 2019 IEEE International Symposium on Workload Characterization (IISWC).

[9]  Nathan Beckmann,et al.  PHI: Architectural Support for Synchronization- and Bandwidth-Efficient Commutative Scatter Updates , 2019, MICRO.

[10]  Calvin Lin,et al.  Applying Deep Learning to the Cache Replacement Problem , 2019, MICRO.

[11]  Zi Yan,et al.  Translation Ranger: Operating System Support for Contiguity-Aware TLBs , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[12]  Jure Leskovec,et al.  Position-aware Graph Neural Networks , 2019, ICML.

[13]  Alex Delis,et al.  MEGA: overcoming traditional problems with OS huge page management , 2019, SYSTOR.

[14]  K. Gopinath,et al.  HawkEye: Efficient Fine-grained OS Support for Huge Pages , 2019, ASPLOS.

[15]  Li Zhao,et al.  Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[16]  Chao Zhang,et al.  Scrabble: A Fine-Grained Cache with Adaptive Merged Block , 2018, IEEE Transactions on Computers.

[17]  Xiaosong Ma,et al.  Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Valeria Bertacco,et al.  Heterogeneous Memory Subsystem for Natural Graph Analytics , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[19]  Brandon Lucia,et al.  When is Graph Reordering an Optimization? Studying the Effect of Lightweight Graph Reordering Across Applications and Input Graphs , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[20]  Onur Mutlu,et al.  A Case for Richer Cross-Layer Abstractions: Bridging the Semantic Gap with Expressive Memory , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[21]  Xiaosong Ma,et al.  KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[22]  K. Gopinath,et al.  Making Huge Pages Actually Useful , 2018, ASPLOS.

[23]  Jaehyuk Huh,et al.  Hybrid TLB coalescing: Improving TLB translation coverage under diverse fragmented memory allocations , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[24]  Thomas F. Wenisch,et al.  Thermostat: Application-transparent Page Management for Two-tiered Main Memory , 2017, ASPLOS.

[25]  Youngjin Kwon,et al.  Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.

[26]  Zhe Wang,et al.  Perceptron learning for reuse prediction , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Matei Zaharia,et al.  Making caches work for graph analytics , 2016, 2017 IEEE International Conference on Big Data (Big Data).

[28]  Xuemin Lin,et al.  Speedup Graph Processing by Graph Ordering , 2016, SIGMOD Conference.

[29]  Shuaiwen Song,et al.  Tag-Split Cache for Efficient GPGPU Cache Utilization , 2016, ICS.

[30]  Osman S. Unsal,et al.  Range Translations for Fast Virtual Memory , 2016, IEEE Micro.

[31]  Srinivas Devadas,et al.  IMP: Indirect memory prefetcher , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[32]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[33]  David Roberts,et al.  Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[34]  Vivien Quéma,et al.  Large Pages May Be Harmful on NUMA Systems , 2014, USENIX Annual Technical Conference.

[35]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[36]  Eric Rotenberg,et al.  Jigsaw: Scalable software-defined caches , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[37]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[38]  Aamer Jaleel,et al.  CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[39]  Sandhya Dwarkadas,et al.  Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[40]  Long Wang,et al.  Towards an Understanding of Oversubscription in Cloud , 2012, Hot-ICE.

[41]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[42]  Ricardo Bianchini,et al.  Page placement in hybrid memory systems , 2011, ICS '11.

[43]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[44]  Jeffrey K. Hollingsworth,et al.  Hardware monitors for dynamic page migration , 2008, J. Parallel Distributed Comput..

[45]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[46]  Paul R. Wilson,et al.  The memory fragmentation problem: solved? , 1998, ISMM '98.

[47]  Alan L. Cox,et al.  A Comprehensive Analysis of Superpage Management Mechanisms and Policies , 2020, USENIX Annual Technical Conference.

[48]  Andy Whitcroft,et al.  The What, The Why and the Where To of Anti-Fragmentation , 2010 .

[49]  Peter J. Denning,et al.  Virtual memory , 1970, CSUR.