Millions of Tiny Databases

Starting in 2013, we set out to build a new database to act as the configuration store for a high-performance cloud block storage system (Amazon EBS).This database needs to be not only highly available, durable, and scalable but also strongly consistent. We quickly realized that the constraints on availability imposed by the CAP theorem, and the realities of operating distributed systems, meant that we didn’t want one database. We wanted millions. Physalia is a transactional keyvalue store, optimized for use in large-scale cloud control planes, which takes advantage of knowledge of transaction patterns and infrastructure design to offer both high availability and strong consistency to millions of clients. Physalia uses its knowledge of datacenter topology to place data where it is most likely to be available. Instead of being highly available for all keys to all clients, Physalia focuses on being extremely available for only the keys it knows each client needs, from the perspective of that client. This paper describes Physalia in context of Amazon EBS, and some other uses within Amazon Web Services. We believe that the same patterns, and approach to design, are widely applicable to distributed systems problems like control planes, configuration management, and service discovery.

[1]  Fernando M. V. Ramos,et al.  Software-Defined Networking: A Comprehensive Survey , 2014, Proceedings of the IEEE.

[2]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[3]  Nick Feamster,et al.  Design and implementation of a routing control platform , 2005, NSDI.

[4]  Jon Postel,et al.  DOD standard transmission control protocol , 1980, CCRV.

[5]  Barbara Liskov,et al.  Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems , 1999, PODC '88.

[6]  Philip A. Bernstein,et al.  Rethinking eventual consistency , 2013, SIGMOD '13.

[7]  Peter Bailis,et al.  The network is reliable , 2014, Commun. ACM.

[8]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[9]  B. M. Oki,et al.  VIEWSTAMPED REPLICATION FOR HIGHLY AVAILABLE DISTRIBUTED SYSTEMS , 1988 .

[10]  Neil J. Gunther,et al.  A General Theory of Computational Scalability Based on Rational Functions , 2008, ArXiv.

[11]  Yehuda Lindell,et al.  GCM-SIV: Full Nonce Misuse-Resistant Authenticated Encryption at Under One Cycle per Byte , 2015, CCS.

[12]  Butler W. Lampson,et al.  How to Build a Highly Available System Using Consensus , 1996, WDAG.

[13]  Samer Al-Kiswany,et al.  An Analysis of Network-Partitioning Failures in Cloud Systems , 2018, OSDI.

[14]  Andreas Haeberlen,et al.  Fault Tolerance and the Five-Second Rule , 2015, HotOS.

[15]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[16]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[17]  Robert Tappan Morris,et al.  Designing a DHT for Low Latency and High Throughput , 2004, NSDI.

[18]  Ivan Beschastnikh,et al.  Scalable consistency in Scatter , 2011, SOSP.

[19]  Marvin Theimer,et al.  Session guarantees for weakly consistent replicated data , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[20]  Tobias Distler,et al.  Enhancing coordination in cloud infrastructures with an extendable coordination service , 2012, SDMCMM '12.

[21]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[22]  André Schiper,et al.  Optimizing Paxos with batching and pipelining , 2013, Theor. Comput. Sci..

[23]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[24]  David R. Cheriton,et al.  Leases: an efficient fault-tolerant mechanism for distributed file cache consistency , 1989, SOSP '89.

[25]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[26]  Marc Najork,et al.  Boxwood: Abstractions as the Foundation for Storage Infrastructure , 2004, OSDI.

[27]  Leslie Lamport Who builds a house without drawing blueprints? , 2015, Commun. ACM.

[28]  Robbert van Renesse,et al.  Leveraging sharding in the design of scalable replication protocols , 2013, SoCC.

[29]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[30]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[31]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[32]  Samuel B. Williams,et al.  ASSOCIATION FOR COMPUTING MACHINERY , 2000 .

[33]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[34]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[35]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[36]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[37]  John R. Douceur,et al.  The Sybil Attack , 2002, IPTPS.

[38]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[39]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[40]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[41]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[42]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[43]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[44]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[45]  Andrea C. Arpaci-Dusseau,et al.  Fault-Tolerance, Fast and Slow: Exploiting Failure Asynchrony in Distributed Systems , 2018, OSDI.

[46]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[47]  Yu Luo,et al.  Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[48]  Leslie Lamport,et al.  Model Checking TLA+ Specifications , 1999, CHARME.