Scaling Out Link Prediction with SNAPLE

A growing number of organizations are seeking to analyze extra large graphs in a timely and resource-efficient manner. With some graphs containing well over a billion elements, these organizations are turning to distributed graph-computing platforms that can scale out easily in existing data-centers and clouds. Unfortunately such platforms usually impose programming models that can be ill suited to typical graph computations, fundamentally undermining their potential benefits. In this paper, we consider how the emblematic problem of link-prediction can be implemented efficiently in gather-apply-scatter (GAS) platforms, a popular distributed graph-computation model. Our proposal, called Snaple, exploits a novel highly-localized vertex scoring technique, and minimizes the cost of data flow while maintaining prediction quality. When used within GraphLab, Snaple can scale to very large graphs that a standard implementation of link prediction on GraphLab cannot handle. More precisely, we show that Snaple can process a graph containing 1.4 billions edges on a 256 cores cluster in less than three minutes, with no penalty in the quality of predictions. This result corresponds to an over-linear speedup of 30 against a 20-core standalone machine running a non-distributed state-of-the-art solution.

[1]  Jimmy J. Lin,et al.  WTF: the who to follow service at Twitter , 2013, WWW.

[2]  M. Tamer Özsu,et al.  An Experimental Comparison of Pregel-like Graph Processing Systems , 2014, Proc. VLDB Endow..

[3]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[4]  Aapo Kyrola,et al.  DrunkardMob: billions of random walks on just a PC , 2013, RecSys.

[5]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[6]  Nitesh V. Chawla,et al.  New perspectives and methods in link prediction , 2010, KDD.

[7]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[8]  Ashish Goel,et al.  FAST-PPR: scaling personalized pagerank estimation for large graphs , 2014, KDD.

[9]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[10]  Rossano Schifanella,et al.  Folks in Folksonomies: social link prediction from shared metadata , 2010, WSDM '10.

[11]  Doina Caragea,et al.  Predicting Friendship Links in Social Networks Using a Topic Modeling Approach , 2011, PAKDD.

[12]  Matthew Rowe,et al.  Who Will Follow Whom? Exploiting Semantics for Link Prediction in Attention-Information Networks , 2012, SEMWEB.

[13]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[14]  Dong Xin,et al.  Fast personalized PageRank on MapReduce , 2011, SIGMOD '11.

[15]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[16]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[17]  Jure Leskovec,et al.  Friendship and mobility: user movement in location-based social networks , 2011, KDD.

[18]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[19]  Rudolf Eigenmann,et al.  Programming Distributed Memory Sytems Using OpenMP , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[20]  Panos Kalnis,et al.  Mizan: a system for dynamic load balancing in large-scale graph processing , 2013, EuroSys '13.

[21]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[22]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[23]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[24]  Purnamrita Sarkar,et al.  Fast nearest-neighbor search in disk-resident graphs , 2010, KDD.

[25]  Brian D. Davison,et al.  Structural link analysis and prediction in microblogs , 2011, CIKM '11.

[26]  Brian N. Bershad,et al.  The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[27]  András A. Benczúr,et al.  To randomize or not to randomize: space optimal summaries for hyperlink analysis , 2006, WWW '06.

[28]  Cecilia Mascolo,et al.  Exploiting place features in link prediction on location-based social networks , 2011, KDD.

[29]  Daniel M. Romero,et al.  Who Should I Follow? Recommending People in Directed Social Networks , 2011, ICWSM.

[30]  L. Takac DATA ANALYSIS IN PUBLIC SOCIAL NETWORKS , 2012 .

[31]  Mohammad Al Hasan,et al.  A Survey of Link Prediction in Social Networks , 2011, Social Network Data Analytics.

[32]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[33]  Kirk L. Johnson,et al.  CRL: high-performance all-software distributed shared memory , 1995, SOSP.

[34]  Jennifer Widom,et al.  GPS: a graph processing system , 2013, SSDBM.

[35]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[36]  David A. Bader,et al.  STINGER: High performance data structure for streaming graphs , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[37]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[38]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[39]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[40]  Anne-Marie Kermarrec,et al.  Cheap and Cheerful: Trading Speed and Quality for Scalable Social-Recommenders , 2015, DAIS.