论文信息 - Scalable Error Isolation for Distributed Systems

Scalable Error Isolation for Distributed Systems

In distributed systems, data corruption on a single node can propagate to other nodes in the system and cause severe outages. The probability of data corruption is already non-negligible today in large computer populations (e.g., in large datacenters). The resilience of processors is expected to decline in the near future, making it necessary to devise cost-effective software approaches to deal with data corruption. In this paper, we present SEI, an algorithm that tolerates Arbitrary State Corruption (ASC) faults and prevents data corruption from propagating across a distributed system. SEI scales in three dimensions: memory, number of processing threads, and development effort. To evaluate development effort, fault coverage, and performance with our library, we hardened two real-world applications: a DNS resolver and memcached. Hardening these applications required minimal changes to the existing code base, and the performance overhead is negligible in the case of applications that are not CPU-intensive, such as memcached. The memory overhead is negligible independent of the application when using ECC memory. Finally, SEI covers faults effectively: it detected all hardware-injected errors and reduced undetected errors from 44% down to only 0.15% of the software-injected computation errors in our experiments.

[1] Fan Yang,et al. Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing , 2014, Proc. VLDB Endow..

[2] Song Jiang,et al. Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[3] Amin Ansari,et al. Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS 2010.

[4] Eduardo Pinheiro,et al. DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[5] Shekhar Y. Borkar,et al. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[6] Miguel Castro,et al. Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[7] Leslie Lamport,et al. The part-time parliament , 1998, TOCS.

[8] Lorenzo Alvisi,et al. Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[9] Miguel Correia,et al. Practical Hardening of Crash-Tolerant Systems , 2012, USENIX Annual Technical Conference.

[10] Priya Narasimhan,et al. Thema: Byzantine-fault-tolerant middleware for Web-service applications , 2005, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).

[11] Bianca Schroeder,et al. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[12] Brett D. Fleisch,et al. The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[13] Mark Bickford,et al. Nysiad: Practical Protocol Transformation to Tolerate Byzantine Failures , 2008, NSDI.

[14] Tony Tung,et al. Scaling Memcache at Facebook , 2013, NSDI.

[15] Edward J. McCluskey,et al. Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[16] Robert Griesemer,et al. Paxos made live: an engineering perspective , 2007, PODC '07.

[17] Christof Fetzer,et al. HardPaxos: Replication Hardened against Hardware Errors , 2014, 2014 IEEE 33rd International Symposium on Reliable Distributed Systems.

[18] Leslie Lamport,et al. The Byzantine Generals Problem , 1982, TOPL.

[19] Yang Wang,et al. All about Eve: Execute-Verify Replication for Multi-Core Servers , 2012, OSDI.

[20] Christopher Frost,et al. Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[21] Shekhar Y. Borkar,et al. Microarchitecture and Design Challenges for Gigascale Integration , 2004, MICRO.

[22] John R. Douceur,et al. Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[23] Cristian Constantinescu,et al. Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[24] Ramakrishna Kotla,et al. Zyzzyva: speculative byzantine fault tolerance , 2007, TOCS.

[25] Kunle Olukotun,et al. The Future of Microprocessors , 2005, ACM Queue.

[26] Amin Ansari,et al. Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[27] Pramod Bhatotia,et al. Reliable data-center scale computations , 2010, LADIS '10.

[28] Mahadev Konar,et al. ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[29] David Walker,et al. Fault-tolerant typed assembly language , 2007, PLDI '07.

[30] Wouter Joosen,et al. Bitsquatting: exploiting bit-flips for fun, or profit? , 2013, WWW.

[31] Yawei Li,et al. Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[32] Hui Ding,et al. TAO: how facebook serves the social graph , 2012, SIGMOD Conference.

[33] Miguel Castro,et al. Using abstraction to improve fault tolerance , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[34] Miguel Castro,et al. BASE: using abstraction to improve fault tolerance , 2001, SOSP.

[35] David I. August,et al. SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[36] Christof Fetzer,et al. Automatically Tolerating Arbitrary Faults in Non-malicious Settings , 2013, 2013 Sixth Latin-American Symposium on Dependable Computing.

[37] Ramakrishna Kotla,et al. Zyzzyva , 2007, SOSP.

[38] Ravishankar K. Iyer,et al. Group communication protocols under errors , 2003, 22nd International Symposium on Reliable Distributed Systems, 2003. Proceedings..

[39] Marc Hamilton,et al. Software Development: Building Reliable Systems , 1999 .

[40] Christof Fetzer,et al. Towards transparent hardening of distributed systems , 2013, HotDep.

[41] Lisa Spainhower,et al. Commercial fault tolerance: a tale of two systems , 2004, IEEE Transactions on Dependable and Secure Computing.

[42] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.