Arthur : Rich Post-Facto Debugging for Production Analytics Applications

Debugging the massive parallel computations that run in today’s datacenters is hard, as they consist of thousands of tasks processing terabytes of data. It is especially hard in production settings, where performance overheads of more than a few percent are unacceptable. To address this challenge, we present Arthur, a new debugger that provides a rich set of analysis tools at close to zero runtime overhead through selective replay of data flow applications. Unlike previous replay debuggers, which add high overheads due to the need to log low-level nondeterministic events, Arthur takes advantage of the structure of large-scale data flow models (e.g., MapReduce), which split work into deterministic tasks for fault tolerance, to minimize its logging cost. We use selective replay to implement a variety of debugging features, including rerunning any task in a single-process debugger; ad-hoc queries on computation state; and forward and backward tracing of records through the computation, which we achieve using a program transformation at replay time. We implement Arthur for Hadoop and Spark, and show that it can be used to find a variety of real bugs.

[1]  Ken Yocum,et al.  Scalable lineage capture for debugging DISC analytics , 2013, SoCC.

[2]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[3]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[4]  Christopher Olston,et al.  Inspector gadget: a framework for custom monitoring and debugging of distributed dataflows , 2011, SIGMOD '11.

[5]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[6]  Zuoning Yin,et al.  Monitoring and Debugging DryadLINQ Applications with Daphne , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[7]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[8]  K. Claessen,et al.  QuickCheck: a lightweight tool for random testing of Haskell programs , 2000, SIGP.

[9]  Jennifer Widom,et al.  Provenance for Generalized Map and Reduce Workflows , 2011, CIDR.

[10]  Rares Vernica,et al.  Flexible and Extensible Foundation for Data- Intensive Computing , 2011 .

[11]  Deepak Altekar Replay Debugging for the Datacenter by Gautam , 2011 .

[12]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[13]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[14]  I. Stoica,et al.  DCR : Replay Debugging for the Datacenter Gautam Altekar , 2010 .

[15]  Ion Stoica,et al.  ODR: output-deterministic replay for multicore debugging , 2009, SOSP '09.

[16]  Xuezheng Liu,et al.  Usenix Association 8th Usenix Symposium on Operating Systems Design and Implementation R2: an Application-level Kernel for Record and Replay , 2022 .

[17]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[18]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[19]  Ion Stoica,et al.  Friday: Global Comprehension for Distributed Replay , 2007, NSDI.

[20]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[21]  Dirk Grunwald,et al.  Shadow Profiling: Hiding Instrumentation Costs with Parallelism , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[22]  Scott Shenker,et al.  Replay debugging for distributed applications , 2006 .

[23]  Sheng Liang,et al.  Dynamic class loading in the Java virtual machine , 1998, OOPSLA '98.