Data center infrastructure uptime increasing, unplanned downtime reduction and data integrity main-taining are increasingly critical in today’s real-time, service-level agreement (SLA)-driven cloud service business environment. Server, as backbone of cloud computing, has been developing and evolving with diverse challenges of preserving data integrity, increasing availability, minimizing planned downtime, especially in high temperature ambient (HTA) data center environment So, rock robust server system design for reliability, availability, and serviceability (RAS) are crucial for cloud service providers.Memory errors in server are among the most common hardware causes of machine crashes in production sites with large-scale systems. The higher temperature environment, the more errors. The typical response to memory failures is to replace any affected memory modules, which makes memory modules among the most commonly replaced server components. So, memory failures and their correction are very costly. Based on data collection and failure analysis from Baidu infrastructure maintenance group, server system memory (uncorrectable error) failure rate is Top 1 in data center. For reducing memory failure rate and related server downtime, Baidu developed an advanced server handling memory correctable and uncorrectable errors throughout a "6 pillars" complete application stack, from the underlying hardware to the scheduling system. Such solutions involve three components: (1) reliability, how the solution preserves data integrity; (2) availability, how it guarantees uninterrupted operation with minimal degradation; and (3) serviceability, how it simplifies proactively and reactively dealing with failed or potentially failed components. Availability is not an independent vector.This paper addresses Baidu rack server memory RAS architecture and design, scoping from Intel Xeon processor hardware errors avoidance, detection and correction RAS features for system reliability and improves fault tolerance; failure identification and reconfiguration such as leakage bucket based software-enhanced error recovery and error containment; extending to kernel level page retirement as well as high availability scheduler. Also, Memory RAS system design specific for HTA data center environment is detail introduced. Then, the related lab test procedure, data, and observations are summarized at the end of each section. Overall conclusion and future work plan are summarized in the end.
[1]
Onur Mutlu,et al.
A case for exploiting subarray-level parallelism (SALP) in DRAM
,
2012,
2012 39th Annual International Symposium on Computer Architecture (ISCA).
[2]
Timothy J. Dell,et al.
A white paper on the benefits of chipkill-correct ecc for pc server main memory
,
1997
.
[3]
Brendan Murphy.
Automating Software Failure Reporting
,
2004,
ACM Queue.
[4]
Chun Wang,et al.
An advanced energy efficient rack server design
,
2017,
2017 16th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm).
[5]
Kashi Venkatesh Vishwanath,et al.
Characterizing cloud computing hardware reliability
,
2010,
SoCC '10.