Toward Fast and Scalable Random Walks over Disk-Resident Graphs via Efficient I/O Management

Traditional graph systems mainly use the iteration-based model, which iteratively loads graph blocks into memory for analysis so as to reduce random I/Os. However, this iteration-based model limits the efficiency and scalability of running random walk, which is a fundamental technique to analyze large graphs. In this article, we first propose a state-aware I/O model to improve the I/O efficiency of running random walk, then we develop a block-centric indexing and buffering scheme for managing walk data, and leverage an asynchronous walk updating strategy to improve random walk efficiency. We implement an I/O-efficient graph system, GraphWalker, which is efficient to handle very large disk-resident graphs and also scalable to run tens of billions of random walks with only a single commodity machine. Experiments show that GraphWalker can achieve more than an order of magnitude speedup when compared with DrunkardMob, which is tailored for random walks based on the classical graph system GraphChi, as well as two state-of-the-art single-machine graph systems, Graphene and GraFSoft. Furthermore, when compared with the most recent distributed system KnightKing, GraphWalker still achieves comparable performance with only a single machine, thereby making it a more cost-effective alternative.

[1]  Yongwei Wu,et al.  Random Walks on Huge Graphs at Cache Efficiency , 2021, SOSP.

[2]  Xiaosong Ma,et al.  KnightKing: a fast distributed graph random walk engine , 2019, SOSP.

[3]  Wenguang Chen,et al.  LiveGraph , 2019, Proc. VLDB Endow..

[4]  Zhiyong Wu,et al.  Fast graph centrality computation via sampling: a case study of influence maximisation over OSNs , 2019, Int. J. High Perform. Comput. Netw..

[5]  Binyu Zang,et al.  PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs , 2019, TOPC.

[6]  Sizhuo Zhang,et al.  GraFBoost: Using Accelerated Flash Storage for External Graph Analytics , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[7]  James Cheng,et al.  G-Miner: an efficient task-oriented graph mining system , 2018, EuroSys.

[8]  Weimin Zheng,et al.  Squeezing out All the Value of Loaded Data: An Out-of-core Graph Processing System with Reduced Disk I/O , 2017, USENIX Annual Technical Conference.

[9]  Mohan Kumar,et al.  Mosaic: Processing a Trillion-Edge Graph on a Single Machine , 2017, EuroSys.

[10]  H. Howie Huang,et al.  Graphene: Fine-Grained IO Management for Graph Computing , 2017, FAST.

[11]  Arijit Khan,et al.  On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage , 2016, USENIX Annual Technical Conference.

[12]  Wenguang Chen,et al.  Gemini: A Computation-Centric Distributed Graph Processing System , 2016, OSDI.

[13]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[14]  Rajiv Gupta,et al.  Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing , 2016, USENIX Annual Technical Conference.

[15]  Pengpeng Zhao,et al.  Measuring and Maximizing Influence via Random Walk in Social Activity Networks , 2016, DASFAA.

[16]  H. Howie Huang,et al.  Enterprise: breadth-first graph traversal on GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Mohammed J. Zaki,et al.  Arabesque: a system for distributed graph mining , 2015, SOSP.

[18]  Wenguang Chen,et al.  GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[19]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[20]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[21]  Alexander S. Szalay,et al.  FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs , 2014, FAST.

[22]  Keval Vora,et al.  CuSha: vertex-centric graph processing on GPUs , 2014, HPDC '14.

[23]  Hong Cheng,et al.  Random-walk domination in large graphs , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[24]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[25]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[26]  Aapo Kyrola,et al.  DrunkardMob: billions of random walks on just a PC , 2013, RecSys.

[27]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[28]  Carlos Guestrin,et al.  Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 31 Graphchi: Large-scale Graph Computation on Just a Pc , 2022 .

[29]  Xin Xu,et al.  Beyond random walk and metropolis-hastings samplers: why you should not backtrack for unbiased graph sampling , 2012, SIGMETRICS '12.

[30]  Kunle Olukotun,et al.  Green-Marl: a DSL for easy and efficient graph analysis , 2012, ASPLOS XVII.

[31]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[32]  Donald F. Towsley,et al.  Estimating and sampling graphs with multidimensional random walks , 2010, IMC '10.

[33]  Wei Chen,et al.  Efficient influence maximization in social networks , 2009, KDD.

[34]  Martin Ester,et al.  TrustWalker: a random walk model for combining trust-based and item-based recommendation , 2009, KDD.

[35]  Adam Tauman Kalai,et al.  Trust-based recommendation systems: an axiomatic approach , 2008, WWW.

[36]  Pabitra Mitra,et al.  Feature weighting in content based recommendation system using social network analysis , 2008, WWW.

[37]  Jiawei Han,et al.  Adaptive Fastest Path Computation on a Road Network: A Traffic Mining Approach , 2007, VLDB.

[38]  Natasa Przulj,et al.  Biological network comparison using graphlet degree distribution , 2007, Bioinform..

[39]  Christos Faloutsos,et al.  Fast Random Walk with Restart and Its Applications , 2006, Sixth International Conference on Data Mining (ICDM'06).

[40]  Dániel Fogaras,et al.  Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments , 2005, Internet Math..

[41]  Christos Faloutsos,et al.  Automatic multimedia cross-modal correlation discovery , 2004, KDD.

[42]  Igor Jurisica,et al.  Modeling interactome: scale-free or geometric? , 2004, Bioinform..

[43]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[44]  Jon Kleinberg,et al.  Maximizing the spread of influence through a social network , 2003, KDD '03.

[45]  Martin Mauve,et al.  A routing strategy for vehicular ad hoc networks in city environments , 2003, IEEE IV2003 Intelligent Vehicles Symposium. Proceedings (Cat. No.03TH8683).

[46]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[47]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[48]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[49]  Steve Chien,et al.  Approximating Aggregate Queries about Web Pages via Random Walks , 2000, VLDB.

[50]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[51]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[52]  Ryan Seacrest,et al.  Yahoo , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[53]  Anand Sivasubramaniam,et al.  Large-Scale Graph Processing on Emerging Storage Devices , 2019, FAST.

[54]  Keval Vora,et al.  LUMOS: Dependency-Driven Disk-based Graph Processing , 2019, USENIX ATC.

[55]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[56]  Carlos Guestrin,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012 .

[57]  Andreas Hotho,et al.  FolkRank : A Ranking Algorithm for Folksonomies , 2006, LWA.

[58]  David M. Pennock,et al.  Methods for Sampling Pages Uniformly from the World Wide Web , 2001 .