Trace aware random testing for distributed systems

Distributed and concurrent applications often have subtle bugs that only get exposed under specific schedules. While these schedules may be found by systematic model checking techniques, in practice, model checkers do not scale to large systems. On the other hand, naive random exploration techniques often require a very large number of runs to find the specific interactions needed to expose a bug. In recent years, several random testing algorithms have been proposed that, on the one hand, exploit state-space reduction strategies from model checking and, on the other, provide guarantees on the probability of hitting bugs of certain kinds. These existing techniques exploit two orthogonal strategies to reduce the state space: partial-order reduction and bug depth. Testing algorithms based on partial order techniques, such as RAPOS or POS, ensure non-redundant exploration of independent interleavings among system events by imposing an equivalence relation on schedules and ideally exploring only one schedule from each equivalence class. Techniques based on bug depth, such as PCT, exploit the empirical observation that many bugs are exposed by the clever scheduling of a small number of key events. They bias the sample space of schedules to only cover all executions of small depth, rather than the much larger space of all schedules. At this point, there is no random testing algorithm that combines the power of both approaches. In this paper, we provide such an algorithm. Our algorithm, trace-aware PCT (taPCTCP), extends and unifies several different algorithms in the random testing literature. It samples the space of low-depth executions by constructing a schedule online, while taking dependencies among events into account. Moreover, the algorithm comes with a theoretical guarantee on the probability of sampling a trace of low depth---the probability grows exponentially with the depth but only polynomially with the number of racy events explored. We further show that the guarantee is optimal among a large class of techniques. We empirically compare our algorithm with several state-of-the-art random testing approaches for concurrent software on two large-scale distributed systems, Zookeeper and Cassandra, and show that our approach is effective in uncovering subtle bugs and usually outperforms related random testing algorithms.

[1]  Junfeng Yang,et al.  Partial Order Aware Concurrency Sampling , 2018, CAV.

[2]  Flavio Paiva Junqueira,et al.  Zab: High-performance broadcast for primary-backup systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[3]  Ralph E. Johnson,et al.  Bita: Coverage-guided, automatic testing of actor programs , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[4]  Sanjit A. Seshia,et al.  Systematic testing of asynchronous reactive systems , 2015, ESEC/SIGSOFT FSE.

[5]  Brandon Lucia,et al.  Finding concurrency bugs with context-aware communication graphs , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Koushik Sen,et al.  Effective random testing of concurrent programs , 2007, ASE.

[7]  Patrice Godefroid,et al.  Partial-Order Methods for the Verification of Concurrent Systems , 1996, Lecture Notes in Computer Science.

[8]  Shan Lu,et al.  TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems , 2016, ASPLOS.

[9]  Marcos K. Aguilera,et al.  Detecting failures in distributed systems with the Falcon spy network , 2011, SOSP.

[10]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[11]  Cheng Huang,et al.  Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!) , 2016, FAST.

[12]  Pravesh Kothari,et al.  A randomized scheduler with probabilistic guarantees of finding bugs , 2010, ASPLOS XV.

[13]  Rupak Majumdar,et al.  Hitting Families of Schedules for Asynchronous Programs , 2016, CAV.

[14]  Carl Hewitt,et al.  A Universal Modular ACTOR Formalism for Artificial Intelligence , 1973, IJCAI.

[15]  Carl Hewitt,et al.  Concurrent Programming Using Actors: Exploiting large-Scale Parallelism , 1985, FSTTCS.

[16]  Madan Musuvathi,et al.  Iterative context bounding for systematic testing of multithreaded programs , 2007, PLDI '07.

[17]  Sanjeev Arora,et al.  Computational Complexity: A Modern Approach , 2009 .

[18]  Garth A. Gibson,et al.  dBug: Systematic Testing of Unmodified Distributed and Multi-threaded Systems , 2011, SPIN.

[19]  Jakob Rehof,et al.  Context-Bounded Model Checking of Concurrent Software , 2005, TACAS.

[20]  Pierre Cartier,et al.  Problemes combinatoires de commutation et rearrangements , 1969 .

[21]  Pallavi Joshi,et al.  SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems , 2014, OSDI.

[22]  Sampath Kannan,et al.  A Quasi-Polynomial-Time Algorithm for Sampling Words from a Context-Free Language , 1997, Inf. Comput..

[23]  Rupak Majumdar,et al.  Why is random testing effective for partition tolerance bugs? , 2017, Proc. ACM Program. Lang..

[24]  Sebastian Burckhardt,et al.  Multicore acceleration of priority-based schedulers for concurrency bug detection , 2012, PLDI.

[25]  Zvonimir Rakamaric,et al.  Delay-bounded scheduling , 2011, POPL '11.

[26]  Shan Lu,et al.  FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems , 2019, EuroSys.

[27]  Amin Vahdat,et al.  Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code (Awarded Best Paper) , 2007, NSDI.

[28]  Axel Legay,et al.  TransDPOR: A Novel Dynamic Partial-Order Reduction Technique for Testing Actor Programs , 2012, FMOODS/FORTE.

[29]  Adam Betts,et al.  Concurrency testing using schedule bounding: an empirical study , 2014, PPoPP '14.

[30]  Anurag Agarwal,et al.  Efficient dependency tracking for relevant events in concurrent systems , 2006, Distributed Computing.

[31]  Krishnendu Chatterjee,et al.  Data-centric dynamic partial order reduction , 2016, Proc. ACM Program. Lang..

[32]  Parosh Aziz Abdulla,et al.  Optimal dynamic partial order reduction , 2014, POPL.

[33]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[34]  Hridesh Rajan,et al.  On ordering problems in message passing software , 2016, MODULARITY.

[35]  Parosh Aziz Abdulla,et al.  Source Sets , 2017, J. ACM.

[36]  Patrice Godefroid,et al.  Model checking for programming languages using VeriSoft , 1997, POPL '97.

[37]  Patrice Godefroid,et al.  Dynamic partial-order reduction for model checking software , 2005, POPL '05.

[38]  Marcelo Arenas,et al.  Efficient Logspace Classes for Enumeration, Counting, and Uniform Generation , 2019, PODS.

[39]  Shan Lu,et al.  DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems , 2017, ASPLOS.

[40]  Ahmed Bouajjani,et al.  Bounded phase analysis of message-passing programs , 2012, International Journal on Software Tools for Technology Transfer.

[41]  Peter J. Stuckey,et al.  Context-Sensitive Dynamic Partial Order Reduction , 2017, CAV.

[42]  Rupak Majumdar,et al.  Randomized testing of distributed systems with probabilistic guarantees , 2018, Proc. ACM Program. Lang..

[43]  Kathryn S. McKinley,et al.  Bounded partial-order reduction , 2013, OOPSLA.

[44]  Volker Diekert,et al.  The Book of Traces , 1995 .

[45]  Wojciech Zielonka,et al.  The Book of Traces , 1995 .