Managing large dynamic graphs efficiently

There is an increasing need to ingest, manage, and query large volumes of graph-structured data arising in applications like social networks, communication networks, biological networks, and so on. Graph databases that can explicitly reason about the graphical nature of the data, that can support flexible schemas and node-centric or edge-centric analysis and querying, are ideal for storing such data. However, although there is much work on single-site graph databases and on efficiently executing different types of queries over large graphs, to date there is little work on understanding the challenges in distributed graph databases, needed to handle the large scale of such data. In this paper, we propose the design of an in-memory, distributed graph data management system aimed at managing a large-scale dynamically changing graph, and supporting low-latency query processing over it. The key challenge in a distributed graph database is that, partitioning a graph across a set of machines inherently results in a large number of distributed traversals across partitions to answer even simple queries. We propose aggressive replication of the nodes in the graph for supporting low-latency querying, and investigate three novel techniques to minimize the communication bandwidth and the storage requirements. First, we develop a hybrid replication policy that monitors node read-write frequencies to dynamically decide what data to replicate, and whether to do eager or lazy replication. Second, we propose a clustering-based approach to amortize the costs of making these replication decisions. Finally, we propose using a fairness criterion to dictate how replication decisions should be made. We provide both theoretical analysis and efficient algorithms for the optimization problems that arise. We have implemented our framework as a middleware on top of the open-source CouchDB key-value store. We evaluate our system on a social graph, and show that our system is able to handle very large graphs efficiently, and that it reduces the network bandwidth consumption significantly.

[1]  Marc Gyssens,et al.  A graph-oriented object database model , 1990, IEEE Trans. Knowl. Data Eng..

[2]  Ouri Wolfson,et al.  The multicast policy and its relationship to replicated data placement , 1991, TODS.

[3]  Michel Scholl,et al.  Gram: a graph data model and query languages , 1992, ECHT '92.

[4]  Bruce Hendrickson,et al.  An Improved Spectral Graph Partitioning Algorithm for Mapping Parallel Computations , 1995, SIAM J. Sci. Comput..

[5]  Sushil Jajodia,et al.  An adaptive data replication algorithm , 1997, TODS.

[6]  RalfHiutmut Gtiting,et al.  GraphDB : Modeling and Querying Graphs in Databases , 1998 .

[7]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[8]  Gultekin Özsoyoglu,et al.  A graph query language and its query processing , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[9]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[10]  H E Stanley,et al.  Classes of small-world networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[11]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[12]  M. Newman,et al.  Why social networks are different from other types of networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[13]  S. Sudarshan,et al.  Bidirectional Expansion For Keyword Search on Graph Databases , 2005, VLDB.

[14]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Bernardo A. Huberman,et al.  Rhythms of social interaction: messaging within a massive online network , 2006, ArXiv.

[16]  Ravi Kumar,et al.  Structure and evolution of online social networks , 2006, KDD '06.

[17]  G. Caldarelli,et al.  Preferential attachment in the growth of social networks, the Internet encyclopedia wikipedia , 2007 .

[18]  Kostas Politopoulos,et al.  MAX-DENSITY Revisited: a Generalization and a More Efficient Algorithm , 2007, Comput. J..

[19]  Ambuj K. Singh,et al.  Graphs-at-a-time: query language and access methods for graph databases , 2008, SIGMOD Conference.

[20]  Jure Leskovec,et al.  Statistical properties of community structure in large social and information networks , 2008, WWW.

[21]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[22]  Yang Xiang,et al.  Efficiently answering reachability queries on very large directed graphs , 2008, SIGMOD Conference.

[23]  Yehoshua Sagiv,et al.  Keyword proximity search in complex data graphs , 2008, SIGMOD Conference.

[24]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[25]  Pablo Rodriguez,et al.  Divide and Conquer: Partitioning Online Social Networks , 2009, ArXiv.

[26]  Virgílio A. F. Almeida,et al.  Characterizing user behavior in online social networks , 2009, IMC '09.

[27]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[28]  Haofen Wang,et al.  Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[29]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing - "ABSTRACT" , 2009, PODC '09.

[30]  Samir Khuller,et al.  On Finding Dense Subgraphs , 2009, ICALP.

[31]  Raghu Ramakrishnan,et al.  Feeding frenzy: selectively materializing users' event feeds , 2010, SIGMOD Conference.

[32]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[33]  Jianzhong Li,et al.  Graph pattern matching , 2010, Proc. VLDB Endow..

[34]  Mohammed J. Zaki,et al.  GRAIL , 2010, Proc. VLDB Endow..

[35]  V. S. Subrahmanian,et al.  COSI: Cloud Oriented Subgraph Identification in Massive Social Networks , 2010, 2010 International Conference on Advances in Social Networks Analysis and Mining.

[36]  Carmen Guerrero,et al.  Where are my followers? Understanding the Locality Effect in Twitter , 2011, ArXiv.

[37]  Hector Garcia-Molina,et al.  Where in the world is my data? , 2011, Proc. VLDB Endow..

[38]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[39]  Ümit V. Çatalyürek,et al.  PaToH: Partitioning Tool for Hypergraphs , 1999 .

[40]  Pablo Rodriguez,et al.  The little engine(s) that could: scaling online social networks , 2010, SIGCOMM '10.

[41]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.