G2: A Graph Processing System for Diagnosing Distributed Systems

G2 is a graph processing system for diagnosing distributed systems. It works on execution graphs that model runtime events and their correlations in distributed systems. In G2, a diagnosis process involves a series of queries, expressed in a high-level declarative language that supports both relational and graph-based operators. Each query is compiled into a distributed execution. G2's execution engine supports both parallel relational data processing and iterative graph traversal. Execution graphs in G2 tend to have long paths and are in structure distinctly different from other large-scale graphs, such as social or web graphs. Tailored for execution graphs and graph traversal operations on those graphs, G2's graph engine distinguishes itself by embracing batched asynchronous iterations that allows for better parallelism without barriers, and by enabling partition-level states and aggregation. We have applied G2 to diagnosis of distributed systems such as Berkeley DB, SCOPE/Dryad, and G2 itself to validate its effectiveness. When co-deployed on a 60- machine cluster, G2's execution engine can handle execution graphs with millions of vertices and edges; for instance, using a query in G2, we traverse, filter, and summarize a 130 million-vertex graph into a 12 thousandvertex graph within 268 seconds on 60 machines. The use of an asynchronous model and a partition-level interface delivered a 66% reduction in response time when applied to queries in our diagnosis tasks.

[1]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[2]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[3]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[4]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[5]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[6]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[7]  Armando Fox,et al.  Pinpoint: problem determination in large , 2002 .

[8]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Tao Wang,et al.  Hierarchical dynamic slicing , 2007, ISSTA '07.

[11]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[12]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[13]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[14]  George Candea,et al.  Automated Software Testing as a Service (TaaS) , 2010, SOCC 2010.

[15]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[16]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[17]  George Candea,et al.  Automated software testing as a service , 2010, SoCC '10.

[18]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[19]  Michael J. Freedman,et al.  Experiences with Tracing Causality in Networked Services , 2010, INM/WREN.

[20]  Alexander Aiken,et al.  Using correlated surprise to infer shared influence , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[21]  Ding Yuan,et al.  SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS XV.

[22]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[23]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[24]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[25]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[26]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.