Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
暂无分享,去创建一个
Yu Luo | Ding Yuan | Xu Zhao | Pranay Jain | Michael Stumm | Yongle Zhang | Xin Zhuang | Guilherme Renna Rodrigues | M. Stumm | Ding Yuan | Yu Luo | Pranay Jain | Xin Zhuang | Xu Zhao | Yongle Zhang | G. R. Rodrigues | Zhuang Xin
[1] Dawson R. Engler,et al. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.
[2] Yuanyuan Zhou,et al. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.
[3] Wei Lin,et al. A characteristic study on failures of production distributed data-parallel programs , 2013, 2013 35th International Conference on Software Engineering (ICSE).
[4] Junfeng Yang,et al. An empirical study of operating systems errors , 2001, SOSP.
[5] Dawson R. Engler,et al. Checking system rules using system-specific, programmer-written compiler extensions , 2000, OSDI.
[6] Ratul Mahajan,et al. Understanding BGP misconfiguration , 2002, SIGCOMM '02.
[7] Jim Gray,et al. Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.
[8] Andrea C. Arpaci-Dusseau,et al. A Study of Linux File System Evolution , 2013, FAST.
[9] Junfeng Yang,et al. Practical software model checking via dynamic interface reduction , 2011, SOSP.
[10] Joshua J. Bloch. Effective Java, 2nd Edition , 2008, The Java series ... from the source.
[11] Andrea C. Arpaci-Dusseau,et al. Error propagation analysis for file systems , 2009, PLDI '09.
[12] Chen Fu,et al. Exception-Chain Analysis: Revealing Exception Handling Architecture in Java Server Applications , 2007, 29th International Conference on Software Engineering (ICSE'07).
[13] Randy H. Katz,et al. How Hadoop Clusters Break , 2013, IEEE Software.
[14] Junfeng Yang,et al. Parrot: a practical runtime for deterministic, stable, and reliable threads , 2013, SOSP.
[15] Yuriy Brun,et al. Mining temporal invariants from partially ordered logs , 2011, OPSR.
[16] Jason Nieh,et al. Transparent, lightweight application execution replay on commodity multiprocessor operating systems , 2010, SIGMETRICS '10.
[17] Christophe Calvès,et al. Faults in linux: ten years later , 2011, ASPLOS XVI.
[18] Luis Ceze,et al. Deterministic Process Groups in dOS , 2010, OSDI.
[19] Pallavi Joshi,et al. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems , 2014, OSDI.
[20] Josep Torrellas,et al. DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, International Symposium on Computer Architecture.
[21] Mark Sullivan,et al. Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.
[22] Jean-Claude Laprie,et al. Dependable computing: concepts, limits, challenges , 1995 .
[23] Michael I. Jordan,et al. Detecting large-scale system problems by mining console logs , 2009, SOSP '09.
[24] Xuezheng Liu,et al. Usenix Association 8th Usenix Symposium on Operating Systems Design and Implementation R2: an Application-level Kernel for Record and Replay , 2022 .
[25] Xiao Ma,et al. An empirical study on configuration errors in commercial and open source systems , 2011, SOSP.
[26] Haoxiang Lin,et al. MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.
[27] Amin Vahdat,et al. Life, death, and the critical transition: finding liveness bugs in systems code , 2007 .
[28] Yuanyuan Zhou,et al. Do not blame users for misconfigurations , 2013, SOSP.
[29] NiehJason,et al. Transparent, lightweight application execution replay on commodity multiprocessor operating systems , 2010 .
[30] Satish Narayanasamy,et al. DoublePlay: parallelizing sequential logging and replay , 2011, ASPLOS XVI.
[31] Nick Feamster,et al. Detecting BGP configuration faults with static analysis , 2005 .
[32] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[33] George Candea,et al. The S2E Platform: Design, Implementation, and Applications , 2012, TOCS.
[34] Archana Ganapathi,et al. Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.
[35] Yang Liu,et al. Be conservative: enhancing failure diagnosis with proactive logging , 2012, OSDI 2012.
[36] Jennifer Neville,et al. Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems , 2012, NSDI.
[37] George Candea,et al. Efficient Testing of Recovery Code Using Fault Injection , 2011, TOCS.
[38] Kashi Venkatesh Vishwanath,et al. Characterizing cloud computing hardware reliability , 2010, SoCC '10.
[39] Lorenzo Keller,et al. ConfErr: A tool for assessing resilience to human configuration errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).
[40] Josep Torrellas,et al. DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, 2008 International Symposium on Computer Architecture.
[41] Andrea C. Arpaci-Dusseau,et al. EIO: Error Handling is Occasionally Correct , 2008, FAST.
[42] Richard P. Martin,et al. Understanding and Dealing with Operator Mistakes in Internet Services , 2004, OSDI.
[43] Yuriy Brun,et al. Unifying FSM-inference algorithms through declarative specification , 2013, 2013 35th International Conference on Software Engineering (ICSE).
[44] Radu Banabic,et al. An Extensible Technique for High-Precision Testing of Recovery Code , 2010, USENIX Annual Technical Conference.
[45] Paramvir Bahl,et al. Detailed diagnosis in enterprise networks , 2009, SIGCOMM '09.
[46] Samuel T. King,et al. ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.
[47] Yuanyuan Zhou,et al. Learning from mistakes: a comprehensive study on real world concurrency bug characteristics , 2008, ASPLOS.
[48] Andrea C. Arpaci-Dusseau,et al. FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.