ReproLite: A Lightweight Tool to Quickly Reproduce Hard System Bugs

Cloud systems have become ubiquitous today -- they are used to store and process the tremendous amounts of data being generated by Internet users. These systems run on hundreds of commodity machines, and have a huge amount of non-determinism (thousands of threads and hundreds of processes) in their execution. Therefore, bugs that occur in cloud systems are hard to understand, reproduce, and fix. The state-of-the-art of debugging in the industry is to log messages during execution, and refer to those messages later in case of errors. In ReproLite, we augment the already widespread process of debugging using logs by enabling testers to quickly and easily specify the conjectures that they form regarding the cause of an error (or bug) from execution logs, and to also automatically validate those conjectures. ReproLite includes a Domain Specific Language (DSL) that allows testers to specify all aspects of a potential scenario (e.g., specific workloads, execution operations and their orders, environment non-determinism) that causes a given bug. Given such a scenario, ReproLite can enforce the conditions in the scenario during system execution. Potential buggy scenarios can also be automatically generated from a sequence of log messages that a tester believes indicates the cause of the bug. We have experimented ReproLite with 11 bugs from two popular cloud systems, Cassandra and HBase. We were able to reproduce all of the bugs using ReproLite. We report on our experience with using ReproLite on those bugs.

[1]  Patrick Th. Eugster,et al.  Unified debugging of distributed systems with Recon , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[2]  Hans A. Hansson,et al.  Using deterministic replay for debugging of distributed real-time systems , 2000, Proceedings 12th Euromicro Conference on Real-Time Systems. Euromicro RTS 2000.

[3]  Scott Shenker,et al.  Replay debugging for distributed applications , 2006 .

[4]  Ion Stoica,et al.  Friday: Global Comprehension for Distributed Replay , 2007, NSDI.

[5]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[6]  Ion Stoica,et al.  Automating the debugging of datacenter applications with ADDA , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[7]  Koushik Sen,et al.  CONCURRIT: a domain specific language for reproducing concurrency bugs , 2013, PLDI.

[8]  Ding Yuan,et al.  SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS 2010.

[9]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.

[10]  Tanakorn Leesatapornwongsa,et al.  What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[11]  Pallavi Joshi,et al.  SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems , 2014, OSDI.

[12]  Pallavi Joshi,et al.  SETSUDŌ: perturbation-based testing framework for scalable distributed systems , 2013, TRIOS@SOSP.

[13]  Ding Yuan,et al.  SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS XV.

[14]  Koushik Sen,et al.  PREFAIL: a programmable tool for multiple-failure injection , 2011, OOPSLA '11.

[15]  Wei Lin,et al.  WiDS Checker: Combating Bugs in Distributed Systems , 2007, NSDI.