Scalable Error Isolation for Distributed Systems

In distributed systems, data corruption on a single node can propagate to other nodes in the system and cause severe outages. The probability of data corruption is already non-negligible today in large computer populations (e.g., in large datacenters). The resilience of processors is expected to decline in the near future, making it necessary to devise cost-effective software approaches to deal with data corruption. In this paper, we present SEI, an algorithm that tolerates Arbitrary State Corruption (ASC) faults and prevents data corruption from propagating across a distributed system. SEI scales in three dimensions: memory, number of processing threads, and development effort. To evaluate development effort, fault coverage, and performance with our library, we hardened two real-world applications: a DNS resolver and memcached. Hardening these applications required minimal changes to the existing code base, and the performance overhead is negligible in the case of applications that are not CPU-intensive, such as memcached. The memory overhead is negligible independent of the application when using ECC memory. Finally, SEI covers faults effectively: it detected all hardware-injected errors and reduced undetected errors from 44% down to only 0.15% of the software-injected computation errors in our experiments.

[1]  Fan Yang,et al.  Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing , 2014, Proc. VLDB Endow..

[2]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[3]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS 2010.

[4]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[5]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[6]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[7]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[8]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[9]  Miguel Correia,et al.  Practical Hardening of Crash-Tolerant Systems , 2012, USENIX Annual Technical Conference.

[10]  Priya Narasimhan,et al.  Thema: Byzantine-fault-tolerant middleware for Web-service applications , 2005, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).

[11]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[12]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[13]  Mark Bickford,et al.  Nysiad: Practical Protocol Transformation to Tolerate Byzantine Failures , 2008, NSDI.

[14]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[15]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[16]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[17]  Christof Fetzer,et al.  HardPaxos: Replication Hardened against Hardware Errors , 2014, 2014 IEEE 33rd International Symposium on Reliable Distributed Systems.

[18]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[19]  Yang Wang,et al.  All about Eve: Execute-Verify Replication for Multi-Core Servers , 2012, OSDI.

[20]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[21]  Shekhar Y. Borkar,et al.  Microarchitecture and Design Challenges for Gigascale Integration , 2004, MICRO.

[22]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[23]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[24]  Ramakrishna Kotla,et al.  Zyzzyva: speculative byzantine fault tolerance , 2007, TOCS.

[25]  Kunle Olukotun,et al.  The Future of Microprocessors , 2005, ACM Queue.

[26]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[27]  Pramod Bhatotia,et al.  Reliable data-center scale computations , 2010, LADIS '10.

[28]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[29]  David Walker,et al.  Fault-tolerant typed assembly language , 2007, PLDI '07.

[30]  Wouter Joosen,et al.  Bitsquatting: exploiting bit-flips for fun, or profit? , 2013, WWW.

[31]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[32]  Hui Ding,et al.  TAO: how facebook serves the social graph , 2012, SIGMOD Conference.

[33]  Miguel Castro,et al.  Using abstraction to improve fault tolerance , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[34]  Miguel Castro,et al.  BASE: using abstraction to improve fault tolerance , 2001, SOSP.

[35]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[36]  Christof Fetzer,et al.  Automatically Tolerating Arbitrary Faults in Non-malicious Settings , 2013, 2013 Sixth Latin-American Symposium on Dependable Computing.

[37]  Ramakrishna Kotla,et al.  Zyzzyva , 2007, SOSP.

[38]  Ravishankar K. Iyer,et al.  Group communication protocols under errors , 2003, 22nd International Symposium on Reliable Distributed Systems, 2003. Proceedings..

[39]  Marc Hamilton,et al.  Software Development: Building Reliable Systems , 1999 .

[40]  Christof Fetzer,et al.  Towards transparent hardening of distributed systems , 2013, HotDep.

[41]  Lisa Spainhower,et al.  Commercial fault tolerance: a tale of two systems , 2004, IEEE Transactions on Dependable and Secure Computing.

[42]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.