Arming Cloud Services with Task Aspects

Our cloud services are losing too many battles to faults like software bugs, resource interference, and hardware failures. Many tools can help us win these battles: model checkers to verify, fault injection to find bugs, replay to debug, and many more. Unfortunately, tools are currently afterthoughts in cloud service designs that must either be tediously tangled into service implementations or integrated transparently in ways that fail to effectively capture the service’s problematic non-deterministic (concurrent, asynchronous, and resource access) behavior. This paper makes tooling a first-class concern by having services encoded with tasks whose interactions reliably capture all non-deterministic behavior needed by tools. Task interactions are then exposed in aspects that are useful in encoding cross-cutting behavior; combined, tools encoded as task aspects can integrate with services effectively and transparently. We show how task aspects can be used to ease the development of an online production data service that runs on a hundred machines.

[1]  Mitchell Wand,et al.  Continuations and coroutines , 1984, LFP '84.

[2]  D. König,et al.  Queueing Networks: A Survey of Their Random Processes , 1985 .

[3]  Cristina V. Lopes,et al.  Aspect-Oriented Programming , 1997, ECOOP.

[4]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[5]  Gregor Kiczales,et al.  Using aspectC to improve the modularity of path-specific customization in operating system code , 2001, ESEC/FSE-9.

[6]  Satish Narayanasamy,et al.  DoublePlay: parallelizing sequential logging and replay , 2011, ASPLOS XVI.

[7]  Michael Burrows,et al.  The Chubby Lock Service for Loosely-Coupled Distributed Systems , 2006, OSDI.

[8]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[9]  Satish Narayanasamy,et al.  Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism , 2010, ASPLOS 2010.

[10]  Srikanth Kandula,et al.  Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging , 2004, USENIX Annual Technical Conference, General Track.

[11]  George Candea,et al.  Fast black-box testing of system recovery code , 2012, EuroSys '12.

[12]  Robert Grimm,et al.  Patch (1) Considered Harmful , 2005, HotOS.

[13]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[14]  Yuanyuan Zhou,et al.  PRES: probabilistic replay with execution sketching on multiprocessors , 2009, SOSP '09.

[15]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[16]  Mona Attariyan,et al.  Automating Configuration Troubleshooting with Dynamic Information Flow Analysis , 2010, OSDI.

[17]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[18]  Vijay Janapa Reddi,et al.  PIN: a binary instrumentation tool for computer architecture research and education , 2004, WCAE '04.

[19]  Viktor Kuncak,et al.  CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems , 2009, NSDI.

[20]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[21]  D. Engler,et al.  CMC: a pragmatic approach to model checking real code , 2002, OSDI '02.

[22]  Haoxiang Lin,et al.  G2: A Graph Processing System for Diagnosing Distributed Systems , 2011, USENIX Annual Technical Conference.

[23]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[24]  Amin Vahdat,et al.  Life, death, and the critical transition: finding liveness bugs in systems code , 2007 .

[25]  W. Teitelman PILOT: A STEP TOWARDS MAN-COMPUTER SYMBIOSIS , 1966 .

[26]  Galen C. Hunt,et al.  Detours: binary interception of Win32 functions , 1999 .

[27]  Dean W. Gonzalez,et al.  “=” considered harmful , 1991, ALET.

[28]  Min Xu,et al.  A "flight data recorder" for enabling full-system multiprocessor deterministic replay , 2003, ISCA '03.

[29]  Xuezheng Liu,et al.  Usenix Association 8th Usenix Symposium on Operating Systems Design and Implementation R2: an Application-level Kernel for Record and Replay , 2022 .

[30]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[31]  Robert Tappan Morris,et al.  Event-driven programming for robust software , 2002, EW 10.

[32]  Ion Stoica,et al.  ODR: output-deterministic replay for multicore debugging , 2009, SOSP '09.

[33]  Heeseung Jo,et al.  Task-aware virtual machine scheduling for I/O performance. , 2009, VEE '09.

[34]  Daniel Mahrenholz,et al.  Program instrumentation for debugging and monitoring with AspectC++ , 2002, Proceedings Fifth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing. ISIRC 2002.

[35]  Peter W. O'Hearn,et al.  Smallfoot: Modular Automatic Assertion Checking with Separation Logic , 2005, FMCO.

[36]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[37]  Alan L. Cox,et al.  Causeway: Operating System Support for Controlling and Analyzing the Execution of Distributed Programs , 2005, HotOS.

[38]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[39]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[40]  Kenneth L. McMillan,et al.  Symbolic model checking , 1992 .

[41]  George Candea,et al.  Efficient Testing of Recovery Code Using Fault Injection , 2011, TOCS.

[42]  Viktor Kuncak,et al.  Simplifying Distributed System Development , 2009, HotOS.

[43]  Madan Musuvathi,et al.  Fair stateless model checking , 2008, PLDI '08.

[44]  Amin Vahdat,et al.  Mace: language support for building distributed systems , 2007, PLDI '07.

[45]  Eddie Kohler,et al.  Making Events Less Slippery with eel , 2005, HotOS.

[46]  Samuel T. King,et al.  Debugging Operating Systems with Time-Traveling Virtual Machines (Awarded General Track Best Paper Award!) , 2005, USENIX Annual Technical Conference, General Track.

[47]  Satish Narayanasamy,et al.  Recording shared memory dependencies using strata , 2006, ASPLOS XII.

[48]  Thomas Ball,et al.  Finding and Reproducing Heisenbugs in Concurrent Programs , 2008, OSDI.