Dynamic interaction graphs with probabilistic edge decay

A large scale network of social interactions, such as mentions in Twitter, can often be modeled as a “dynamic interaction graph” in which new interactions (edges) are continually added over time. Existing systems for extracting timely insights from such graphs are based on either a cumulative “snapshot” model or a “sliding window” model. The former model does not sufficiently emphasize recent interactions. The latter model abruptly forgets past interactions, leading to discontinuities in which, e.g., the graph analysis completely ignores historically important influencers who have temporarily gone dormant. We introduce TIDE, a distributed system for analyzing dynamic graphs that employs a new “probabilistic edge decay” (PED) model. In this model, the graph analysis algorithm of interest is applied at each time step to one or more graphs obtained as samples from the current “snapshot” graph that comprises all interactions that have occurred so far. The probability that a given edge of the snapshot graph is included in a sample decays over time according to a user specified decay function. The PED model allows controlled trade-offs between recency and continuity, and allows existing analysis algorithms for static graphs to be applied to dynamic graphs essentially without change. For the important class of exponential decay functions, we provide efficient methods that leverage past samples to incrementally generate new samples as time advances. We also exploit the large degree of overlap between samples to reduce memory consumption from O(N) to O(logN) when maintaining N sample graphs. Finally, we provide bulk-execution methods for applying graph algorithms to multiple sample graphs simultaneously without requiring any changes to existing graph-processing APIs. Experiments on a real Twitter dataset demonstrate the effectiveness and efficiency of our TIDE prototype, which is built on top of the Spark distributed computing framework.

[1]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[2]  David Eppstein,et al.  Dynamic graph algorithms , 2010 .

[3]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[4]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[5]  Ameet Talwalkar,et al.  Knowing when you're wrong: building fast and reliable approximate query processing systems , 2014, SIGMOD Conference.

[6]  Philip S. Yu,et al.  On the temporal dimension of search , 2004, WWW Alt. '04.

[7]  Michael Isard,et al.  Differential Dataflow , 2013, CIDR.

[8]  Haixun Wang,et al.  Trinity: a distributed graph engine on a memory cloud , 2013, SIGMOD '13.

[9]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[10]  Martin G. Everett,et al.  A Graph-theoretic perspective on centrality , 2006, Soc. Networks.

[11]  Udayan Khurana,et al.  Efficient snapshot retrieval over historical graph data , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[12]  Zhengping Qian,et al.  TimeStream: reliable stream computation in the cloud , 2013, EuroSys '13.

[13]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[14]  Ihab F. Ilyas,et al.  Probabilistic Ranking Techniques in Relational Databases , 2011, Probabilistic Ranking Techniques in Relational Databases.

[15]  Enhong Chen,et al.  Kineograph: taking the pulse of a fast-changing and connected world , 2012, EuroSys '12.

[16]  Thorsten Dickhaus,et al.  Simultaneous Statistical Inference , 2014, Springer Berlin Heidelberg.

[17]  Yossi Matias,et al.  Suggesting friends using the implicit social graph , 2010, KDD.

[18]  Johannes Gehrke,et al.  Fast Iterative Graph Computation with Block Updates , 2013, Proc. VLDB Endow..

[19]  Pierre L'Ecuyer,et al.  TestU01: A C library for empirical testing of random number generators , 2006, TOMS.

[20]  Averill M. Law,et al.  Simulation Modeling and Analysis , 1982 .

[21]  A. Volkova A Refinement of the Central Limit Theorem for Sums of Independent Random Indicators , 1996 .

[22]  Amol Deshpande,et al.  Managing large dynamic graphs efficiently , 2012, SIGMOD Conference.

[23]  Reynold Cheng,et al.  On querying historical evolving graph sequences , 2011, Proc. VLDB Endow..

[24]  Liang Tang,et al.  Applying data mining techniques to address disaster information management challenges on mobile devices , 2011, KDD.

[25]  Peter Grindrod,et al.  A Matrix Iteration for Dynamic Network Summaries , 2013, SIAM Rev..

[26]  Volker Markl,et al.  Spinning Fast Iterative Data Flows , 2012, Proc. VLDB Endow..

[27]  Shirish Tatikonda,et al.  From "Think Like a Vertex" to "Think Like a Graph" , 2013, Proc. VLDB Endow..

[28]  Charu C. Aggarwal,et al.  On biased reservoir sampling in the presence of stream evolution , 2006, VLDB.

[29]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[30]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[31]  Edith Cohen,et al.  Maintaining time-decaying stream aggregates , 2006, J. Algorithms.

[32]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[33]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[34]  Philip S. Yu,et al.  Link Mining: Models, Algorithms, and Applications , 2014, Link Mining.

[35]  Johannes Gehrke,et al.  Asynchronous Large-Scale Graph Processing Made Easy , 2013, CIDR.

[36]  Jon Kleinberg,et al.  Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter , 2011, WWW.

[37]  Pierre L'Ecuyer,et al.  Efficient Jump Ahead for 2-Linear Random Number Generators , 2006, INFORMS J. Comput..