Exposing Complex Bug-Triggering Conditions in Distributed Systems via Graph Mining

Software bugs in distributed systems are notoriously hard to find due to the large number of components involved and the non-determinism introduced by race conditions between messages. This paper introduces Pop Mine, a tool for diagnosing corner-case bugs by finding the minimal causal directed acyclic graph (DAG) of events, spanning multiple processes, which captures a bug-triggering condition. Being based on causal order, a global notion of time is not required in uncovering bug-triggering distributed event patterns. Bug triggering event DAGs can be identified by comparing execution graphs from successful runs to those where bug manifestations were observed, and exposing the minimal discriminative event DAGs that may be responsible for the problem. This is a significant extension to prior debugging tools, in that prior work considered much simpler bug-triggering conditions such as single events, event sets, or ordered chains of events. To the authors' knowledge, this is the first paper that considers bug-triggering conditions in the form of distributed event graphs. To prove the effectiveness of our approach, we applied our tool to VCP, Chord and GreenGPS and diagnosed bugs. We also present performance analysis results to demonstrate the scalability of our approach.

[1]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[2]  Jiawei Han,et al.  Dustminer: troubleshooting interactive complexity bugs in sensor networks , 2008, SenSys '08.

[3]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[4]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[5]  Wei Lin,et al.  WiDS Checker: Combating Bugs in Distributed Systems , 2007, NSDI.

[6]  Hong Cheng,et al.  Identifying bug signatures using discriminative graph mining , 2009, ISSTA.

[7]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[8]  Bernd Mohr,et al.  A scalable tool architecture for diagnosing wait states in massively parallel applications , 2009, Parallel Comput..

[9]  Martin Schulz,et al.  Detecting Patterns in MPI Communication Traces , 2008, 2008 37th International Conference on Parallel Processing.

[10]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[11]  Tarek F. Abdelzaher,et al.  GreenGPS: a participatory sensing fuel-efficient maps application , 2010, MobiSys '10.

[12]  Jiawei Han,et al.  Classification of software behaviors for failure detection: a discriminative pattern mining approach , 2009, KDD.

[13]  Michael I. Jordan,et al.  Scalable statistical bug isolation , 2005, PLDI '05.

[14]  Jiawei Han,et al.  Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[15]  Scott Shenker,et al.  Replay debugging for distributed applications , 2006 .

[16]  Ion Stoica,et al.  Friday: Global Comprehension for Distributed Replay , 2007, NSDI.

[17]  Chao Liu,et al.  Efficient mining of iterative patterns for software specification discovery , 2007, KDD '07.

[18]  Trishul M. Chilimbi,et al.  HOLMES: Effective statistical debugging via efficient path profiling , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[19]  Deborah Estrin,et al.  Sympathy for the sensor network debugger , 2005, SenSys '05.

[20]  Abdalkarim Awad,et al.  Virtual Cord Protocol (VCP): A flexible DHT-like routing service for sensor networks , 2008, 2008 5th IEEE International Conference on Mobile Ad Hoc and Sensor Systems.