Surviving Congestion in Geo-Distributed Storage Systems

We present Vivace, a key-value storage system for web applications that span many geographically-distributed sites. Vivace provides strong consistency and replicates data across sites for access locality and disaster tolerance. Vivace is designed to cope well with network congestion across sites, which occurs because the bandwidth across sites is smaller than within sites. To deal with congestion, Vivace relies on two novel algorithms that prioritize a small amount of critical data to avoid delays due to congestion. We evaluate Vivace to show its feasibility and effectiveness.

[1]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[2]  Jonathan Kirsch,et al.  Scaling Byzantine Fault-Tolerant Replication toWide Area Networks , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[3]  Marcos K. Aguilera,et al.  Thrifty Generic Broadcast , 2000, DISC.

[4]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[5]  Marvin Theimer,et al.  Managing update conflicts in Bayou, a weakly connected replicated storage system , 1995, SOSP.

[6]  Magnus Karlsson,et al.  Taming aggressive replication in the Pangaea wide-area file system , 2002, OPSR.

[7]  Leslie Lamport,et al.  Disk Paxos , 2003, Distributed Computing.

[8]  Leslie Lamport,et al.  Generalized Consensus and Paxos , 2005 .

[9]  Thomas E. Anderson,et al.  xFS: a wide area mass storage file system , 1993, Proceedings of IEEE 4th Workshop on Workstation Operating Systems. WWOS-III.

[10]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[11]  Rodrigo Rodrigues,et al.  Rosebud: A Scalable Byzantine-Fault-Tolerant Storage Architecture , 2003 .

[12]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[13]  Jakob Nielsen,et al.  Designing Web Usability: The Practice of Simplicity , 1999 .

[14]  Maria Ebling,et al.  Exploiting weak connectivity for mobile file access , 1995, SOSP.

[15]  Keith Marzullo,et al.  Mencius: Building Efficient Replicated State Machine for WANs , 2008, OSDI.

[16]  Michael K. Reiter,et al.  Fault-scalable Byzantine fault-tolerant services , 2005, SOSP '05.

[17]  Srinath T. V. Setty,et al.  Depot: Cloud Storage with Minimal Trust , 2010, TOCS.

[18]  Chandramohan A. Thekkath,et al.  Frangipani: a scalable distributed file system , 1997, SOSP.

[19]  Lei Gao,et al.  PRACTI Replication , 2006, NSDI.

[20]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[21]  Nancy A. Lynch,et al.  Rambo: a robust, reconfigurable atomic memory service for dynamic networks , 2010, Distributed Computing.

[22]  GhemawatSanjay,et al.  The Google file system , 2003 .

[23]  Marcos K. Aguilera,et al.  A practical scalable distributed B-tree , 2008, Proc. VLDB Endow..

[24]  Marcos K. Aguilera,et al.  Dynamic atomic storage without consensus , 2009, PODC '09.

[25]  Michael J. Freedman,et al.  Don't settle for eventual: scalable causal consistency for wide-area storage with COPS , 2011, SOSP.

[26]  Dahlia Malkhi,et al.  Active Disk Paxos with infinitely many processes , 2002, PODC '02.

[27]  Marcos K. Aguilera,et al.  Sinfonia: a new paradigm for building scalable distributed systems , 2007, SOSP.

[28]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1987, SOSP '87.

[29]  Jon Howell,et al.  Distributed directory service in the Farsite file system , 2006, OSDI '06.

[30]  Priya Narasimhan,et al.  Efficient Byzantine Fault Tolerance for Scalable Storage and Services , 2009 .

[31]  Hagit Attiya,et al.  Sharing memory robustly in message-passing systems , 1990, PODC '90.

[32]  Arif Merchant,et al.  FAB: building distributed enterprise disk arrays from commodity components , 2004, ASPLOS XI.

[33]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[34]  Arun Venkataramani,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Tcp Nice: a Mechanism for Background Transfers , 2022 .

[35]  Tim Kraska,et al.  MDCC: multi-data center consistency , 2012, EuroSys '13.

[36]  Nancy A. Lynch,et al.  Rambo II: rapidly reconfigurable atomic memory for dynamic networks , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[37]  Jean-Philippe Martin,et al.  A framework for dynamic Byzantine storage , 2004, International Conference on Dependable Systems and Networks, 2004.

[38]  André Schiper,et al.  Handling message semantics with Generic Broadcast protocols , 2002, Distributed Computing.

[39]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[40]  Marcos K. Aguilera,et al.  Abortable and query-abortable objects and their efficient implementation , 2007, PODC '07.

[41]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[42]  Michael Dahlin,et al.  Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults , 2009, NSDI.

[43]  David E. Culler,et al.  Scalable, distributed data structures for internet service construction , 2000, OSDI.

[44]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[45]  Ramakrishna Kotla,et al.  Zyzzyva , 2007, SOSP.

[46]  Dan Dobre,et al.  HP: Hybrid Paxos for WANs , 2010, 2010 European Dependable Computing Conference.