The Case for Limping-Hardware Tolerant Clouds

With the advent of cloud computing, thousands of machines are connected and managed collectively. This era is confronted with a new challenge: performance variability, primarily caused by large-scale management issues such as hardware failures, software bugs, and configuration mistakes. In this paper, we highlight one overlooked cause: limping hardware – hardware whose performance degrades significantly compared to its specification. We present numerous cases of limping disks, network and processors seen in production, along with the negative impacts of such failures on existing large-scale distributed systems. From these findings, we advocate the concept of limping-hardware tolerant clouds.

[1]  Byung-Gon Chun,et al.  Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 135 Megapipe: a New Programming Interface for Scalable Network I/o , 2022 .

[2]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[3]  Irfan Ahmad,et al.  Pesto: online storage performance management in virtualized datacenters , 2011, SoCC.

[4]  Andrea C. Arpaci-Dusseau,et al.  Fail-stutter fault tolerance , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[5]  Zhao Zhang,et al.  Software thermal management of dram memory for multicore systems , 2008, SIGMETRICS '08.

[6]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[7]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[8]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[9]  Shan Lu,et al.  Understanding and detecting real-world performance bugs , 2012, PLDI.

[10]  Anees Shaikh,et al.  Performance Isolation and Fairness for Multi-Tenant Cloud Storage , 2012, OSDI.

[11]  Dheeraj Reddy,et al.  Bias scheduling in heterogeneous multi-core architectures , 2010, EuroSys '10.

[12]  Randy H. Katz,et al.  Cake: enabling high-level SLOs on shared storage systems , 2012, SoCC '12.

[13]  Irfan Ahmad,et al.  PARDA: Proportional Allocation of Resources for Distributed Storage Access , 2009, FAST.

[14]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[15]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Tao Zou,et al.  Making time-stepped applications tick in the cloud , 2011, SoCC.

[18]  Haitao Wu,et al.  ICTCP: Incast Congestion Control for TCP , 2010 .

[19]  Haitao Wu,et al.  ICTCP: Incast Congestion Control for TCP in Data-Center Networks , 2010, IEEE/ACM Transactions on Networking.

[20]  Rajeev Gandhi,et al.  Black-Box Problem Diagnosis in Parallel File Systems , 2010, FAST.

[21]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.