The Difference Engine

All systems are a product of their histories. Events over time shape the state of OS and application software both for good and bad; while users make forward progress on productive work, bugs and malicious software may destabilize and corrupt their efforts. The notion of considering system state as mutable through time has been the subject of many recent projects. As examples, efforts have considered replaying slight permutations of recent events to recover fro m timing-related crashes[5], revisiting historical system states to identify the introduction of configuration errors[6], an d rewinding execution to assist in debugging[2]. In all of these examples, revisiting—and in some cases even modifying—history allows the exploration of a system’s state space of which the “current” incarnation is but one instance. We believe that there is broad benefit in providing general techniques to allow the more thorough exploration and analysis of the execution states of a given sys tem. We are developing a tool to assist this exploration, which we have dubbed the Difference Engine. The difference engine is primarily concerned with understanding thedivergence between alternate states of a given system. For example, we may choose to create an alternate instance of a desktop OS in which a malicious network packet had never arrived, but for which the remainder of history had proceeded identically. The engine allows the controlled creation, and thorough analysis, of such divergence. It exploits the ability of virtual machines (VMs) to be check-pointed, rolled back and replayed to create alternate but plausible new outcomes. This set of outcomes, each represented by an independent VM instance, can be viewed as parallel universes where history occurred in subtly different ways. In the case of the malicious packet just mentioned, the engine allows us to consider a large number of alternate universes that might result as a consequence of the packet’s delivery being prevented, and to build insight into the specific mutations that its arrival induced. The operation of the difference engine involves two phases, generational and analytical. In the generational stage, the difference engine allows the replay of logged external events to a historical version of a virtual machine. Replay is intentionally non-deterministic, and may be parametrized as to modify the stream of events that are delivered. In the second, analysis stage, the engine provides tools to assist with semantic comparisons between the resulting alternate states. These two stages are illustrat ed in Figure 1. We are currently developing the difference engine as a tool based on the Xen virtual machine monitor. To date, we have had to grapple with two fundamental challenges: nondeterminism of replay, and system semantics. These issues are closely related and introduce interesting obstacl e in both the replay and analysis stages. Non-determinism is clearly necessary in order for replay to explore alternat e states, but it demands that the replay support be tolerant of externally visible permutations of a system that impact the event log. Similarly, presenting a meaningful understanding of the divergence between a set of alternate instances requires sufficient semantic comprehension of a systems state as to recognize and summarize its differences. We now discuss some specific challenges that nondeterminism and semantics present in the replay and analysis stages and then provides an overview of some example applications. Broadly speaking, we believe that the differ ence engine represents a broadly useful tool for assisting in the exploration of “What if...” questions for large and complex software systems.