An Analysis of Network-Partitioning Failures in Cloud Systems
暂无分享,去创建一个
Samer Al-Kiswany | Ahmed Alquraan | Mohammed Alfatafta | Hatem Takruri | S. Al-Kiswany | Ahmed Alquraan | Mohammed Alfatafta | Hatem Takruri
[1] Marvin Theimer,et al. Managing update conflicts in Bayou, a weakly connected replicated storage system , 1995, SOSP.
[2] Ju Wang,et al. Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.
[3] Tanakorn Leesatapornwongsa,et al. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.
[4] Stefan Savage,et al. California fault lines: understanding the causes and impact of network failures , 2010, SIGCOMM '10.
[5] Serdar Tasiran,et al. VYRD: verifYing concurrent programs by runtime refinement-violation detection , 2005, PLDI '05.
[6] Joseph M. Hellerstein,et al. Lineage-driven Fault Injection , 2015, SIGMOD Conference.
[7] Dutch T. Meyer,et al. Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.
[8] Brian F. Cooper. Spanner: Google's globally-distributed database , 2013, SYSTOR '13.
[9] Pallavi Joshi,et al. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems , 2014, OSDI.
[10] John K. Ousterhout,et al. In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.
[11] Xin Chen,et al. Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.
[12] Robert Birke,et al. Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[13] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..
[14] Haryadi S. Gunawi,et al. Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service Outages , 2016, SoCC.
[15] Jie Xu,et al. An Empirical Failure-Analysis of a Large-Scale Cloud Computing Environment , 2014, 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering.
[16] Arkady Kanevsky,et al. Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics , 2008, TOS.
[17] Archana Ganapathi,et al. Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.
[18] Ethan Katz-Bassett,et al. SPANStore: cost-effective geo-replicated storage spanning multiple cloud services , 2013, SOSP.
[19] Leslie Lamport,et al. Paxos Made Simple , 2001 .
[20] Hairong Kuang,et al. The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).
[21] Jay Kreps,et al. Kafka : a Distributed Messaging System for Log Processing , 2011 .
[22] Haoxiang Lin,et al. MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.
[23] Van-Anh Truong,et al. Availability in Globally Distributed Storage Systems , 2010, OSDI.
[24] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[25] Thomas R. Gross,et al. Automatic testing of sequential and concurrent substitutability , 2013, 2013 35th International Conference on Software Engineering (ICSE).
[26] Nancy A. Lynch,et al. Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.
[27] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.
[28] Kashi Venkatesh Vishwanath,et al. Characterizing cloud computing hardware reliability , 2010, SoCC '10.
[29] Werner Vogels,et al. Dynamo: amazon's highly available key-value store , 2007, SOSP.
[30] Randy H. Katz,et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.
[31] Haoxiang Lin,et al. An Empirical Study on Quality Issues of Production Big Data Platform , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.
[32] Rupak Majumdar,et al. Why is random testing effective for partition tolerance bugs? , 2017, Proc. ACM Program. Lang..
[33] Min Zhu,et al. B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.
[34] Ramesh Govindan,et al. Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure , 2016, SIGCOMM.
[35] Anees Shaikh,et al. A First Look at Problems in the Cloud , 2010, HotCloud.
[36] Emin Gün Sirer,et al. Beehive: O(1) Lookup Performance for Power-Law Query Distributions in Peer-to-Peer Overlays , 2004, NSDI.
[37] Eric A. Brewer,et al. Lessons from Giant-Scale Services , 2001, IEEE Internet Comput..
[38] Bianca Schroeder,et al. Reading between the lines of failure logs: Understanding how HPC systems fail , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[39] Andrea C. Arpaci-Dusseau,et al. FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.
[40] Marcos K. Aguilera,et al. Consistency-based service level agreements for cloud storage , 2013, SOSP.
[41] Hans-Arno Jacobsen,et al. PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..
[42] Navendu Jain,et al. Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.
[43] Carlos Maltzahn,et al. Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.
[44] Yu Luo,et al. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.
[45] Patrice Godefroid,et al. Model checking for programming languages using VeriSoft , 1997, POPL '97.
[46] S. Savage,et al. On Failure in Managed Enterprise Networks , 2012 .
[47] Randy H. Katz,et al. How Hadoop Clusters Break , 2013, IEEE Software.
[48] Robert Griesemer,et al. Paxos made live: an engineering perspective , 2007, PODC '07.
[49] Wei Lin,et al. A characteristic study on failures of production distributed data-parallel programs , 2013, 2013 35th International Conference on Software Engineering (ICSE).
[50] Hui Ding,et al. TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.
[51] Christel Baier,et al. Principles of model checking , 2008 .
[52] Flavio Paiva Junqueira,et al. Zab: High-performance broadcast for primary-backup systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).
[53] Dinghao Wu,et al. KISS: keep it simple and sequential , 2004, PLDI '04.
[54] Kent L. Beck,et al. Test-driven Development - by example , 2002, The Addison-Wesley signature series.