Liberator: A Data Reuse Framework for Out-of-Memory Graph Computing on GPUs

Graph analytics are widely used including recommender systems, scientific computing, and data mining. Meanwhile, GPU has become the major accelerator for such applications. However, the graph size increases rapidly and often exceeds the GPU memory, incurring severe performance degradation due to frequent data transfers between the main memory and GPUs. To relieve this problem, we focus on the utilization of data in GPUs by taking advantage of the data reuse across iterations. In our studies, we deeply analyze the memory access patterns of graph applications at different granularities. We have found that the memory footprint is accessed with a roughly sequential scan without a hotspot, which infers an extremely long reuse distance. Based on our observation, we propose a novel framework, called Liberator, to exploit the data reuse within GPU memory. In Liberator, GPU memory is reserved for the data potentially accessed across iterations to avoid excessive data transfer between the main memory and GPUs. For the data not existing in GPU memory, a Merged and Aligned memory access manner is employed to improve the transmission efficiency. We also further optimize the framework by parallel processing of data in GPU memory and data in the main memory. We have implemented a prototype of the Liberator framework and conducted a series of experiments on performance evaluation. The experimental results show that Liberator can significantly reduce the data transfer overhead, which achieves an average of 2.7x speedup over a state-of-the-art approach.

[1]  Pen-Chung Yew,et al.  Ascetic: Enhancing Cross-Iterations Data Efficiency in Out-of-Memory Graph Processing on GPUs , 2021, International Conference on Parallel Processing.

[2]  Nectarios Koziris,et al.  SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures , 2021, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[3]  Wen-mei W. Hwu,et al.  EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal In GPUs , 2020, Proc. VLDB Endow..

[4]  Arkaprava Basu,et al.  ScoRD: A Scoped Race Detector for GPUs , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[5]  Rami Melhem,et al.  Adaptive Page Migration for Irregular Data-intensive Applications under GPU Memory Oversubscription , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[6]  Rajiv Gupta,et al.  Subway: minimizing data transfer during out-of-GPU-memory graph processing , 2020, EuroSys.

[7]  Ramyad Hadidi,et al.  Batch-Aware Unified Memory Management in GPUs for Irregular Workloads , 2020, ASPLOS.

[8]  David A. Bader,et al.  Traversing large graphs on GPUs with unified memory , 2020, Proc. VLDB Endow..

[9]  Rami G. Melhem,et al.  Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[10]  Rachata Ausavarungnirun,et al.  Mosaic: An Application-Transparent Hardware-Software Cooperative Memory Manager for GPUs , 2018, ArXiv.

[11]  Kang Chen,et al.  Wonderland: A Novel Abstraction-Based Out-Of-Core Graph Processing System , 2018, ASPLOS.

[12]  Tor M. Aamodt,et al.  Warp Scheduling for Fine-Grained Synchronization , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13]  Bo Wu,et al.  Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  Kai Wang,et al.  Graspan: A Single-machine Disk-based Graph System for Interprocedural Static Analyses of Large-scale Systems Code , 2017, ASPLOS.

[15]  Karsten Schwan,et al.  GraphReduce: processing large-scale graphs on accelerator-based systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  H. Howie Huang,et al.  Enterprise: breadth-first graph traversal on GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Laxmi N. Bhuyan,et al.  Scalable SIMD-Efficient Graph Processing on GPUs , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[18]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[19]  Wenguang Chen,et al.  GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[20]  Stephen W. Keckler,et al.  Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.

[21]  Ryan A. Rossi,et al.  The Network Data Repository with Interactive Graph Analytics and Visualization , 2015, AAAI.

[22]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[23]  Keval Vora,et al.  CuSha: vertex-centric graph processing on GPUs , 2014, HPDC '14.

[24]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[25]  Jérôme Kunegis,et al.  KONECT: the Koblenz network collection , 2013, WWW.

[26]  Wooyoung Kim,et al.  Prediction of essential proteins using topological properties in GO-pruned PPI network based on machine learning methods , 2012 .

[27]  Carlos Guestrin,et al.  Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 31 Graphchi: Large-scale Graph Computation on Just a Pc , 2022 .

[28]  Matei Ripeanu,et al.  A yoke of oxen and a thousand chickens for heavy lifting graph processing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[29]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[30]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[31]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[32]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[33]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[34]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[35]  J. Goodman Using cache memory to reduce processor-memory traffic , 1983, ISCA '83.