Fault tolerant transaction architectures

A data store that spans multiple partitions (also called shards) and implements atomic transactions must coordinate transactions that span multiple partitions. Each partition is represented by an Object Manager (OM); users access the system through Transaction Managers (TMs) that export the transaction API (start-transaction, read/write, end-transaction) and initiate the aforementioned coordination to certify the transaction (decide commit/abort). This simplified structure is illustrated in Figure 1. Coordination becomes a challenge in the face of message loss and machine crashes, faults that are a norm in today’s large scale systems. We review here several contemporary architectures, discuss the tradeoffs among them, and compare them through simulation.