Measuring the Impact of Memory Errors on Application  Performance

Memory reliability is a key factor in the design of warehouse-scale computers.  Prior work has focused on the performance overheads of memory fault-tolerance schemes when errors do not occur at all, and when detected but uncorrectable errors occur, which result in machine downtime and loss of availability. We focus on a common third scenario, namely, situations when hard but correctable faults exist in memory; these may  cause an “avalanche” of errors to occur on affected hardware. We expose how the hardware/software mechanisms for managing and reporting memory errors can cause severe performance degradation in systems suffering from hardware faults. We inject faults in DRAM on a real cloud server and quantify the single-machine  performance degradation for both batch and interactive workloads. We observe that for SPEC CPU2006 benchmarks, memory errors can slow down average execution time by up to 2.5<inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives> <inline-graphic xlink:href="gottscho-ieq1-2599513.gif"/></alternatives></inline-formula>. For an interactive web-search workload, average query latency degrades by up to 2.3<inline-formula><tex-math notation="LaTeX">$\times$</tex-math> <alternatives><inline-graphic xlink:href="gottscho-ieq2-2599513.gif"/></alternatives></inline-formula> for a light traffic load, and up to an extreme 3746<inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives> <inline-graphic xlink:href="gottscho-ieq3-2599513.gif"/></alternatives></inline-formula> under peak load. Our analyses of the memory error-reporting stack reveals architecture, firmware, and software opportunities to improve performance  consistency by mitigating the worst-case behavior on faulty hardware.

[1]  Ben Maurer Fail at scale , 2015, Commun. ACM.

[2]  Guan Qiang,et al.  Improving DRAM Fault Characterization through Machine Learning , 2016 .

[3]  Lara Dolecek,et al.  Underdesigned and Opportunistic Computing in Presence of Hardware Variability , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[4]  Song Liu,et al.  Flikker: saving DRAM refresh-power through critical data partitioning , 2011, ASPLOS XVI.

[5]  Ke Chen,et al.  System implications of memory reliability in exascale computing , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6]  Qiang Wu,et al.  Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[7]  Sparsh Mittal A Survey of Architectural Techniques for Managing Process Variation , 2016, ACM Comput. Surv..

[8]  Jie Liu,et al.  Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[9]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[10]  Sudhanva Gurumurthi,et al.  Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11]  Puneet Gupta,et al.  DPCS: Dynamic Power/Capacity Scaling for SRAM Caches in the Nanoscale Era , 2015, ACM Trans. Archit. Code Optim..

[12]  Amin Ansari,et al.  Archipelago: A polymorphic cache design for enabling robust near-threshold operation , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[13]  W. W. Peterson,et al.  Error-Correcting Codes. , 1962 .

[14]  Xin Li,et al.  A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility , 2010, USENIX Annual Technical Conference.

[15]  Long Chen,et al.  E3CC: A memory error protection scheme with novel address mapping for subranked and low-power memories , 2013, ACM Trans. Archit. Code Optim..

[16]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[18]  Karen L. Karavanic,et al.  Performance implications of System Management Mode , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[19]  Luca Benini,et al.  Variability Mitigation in Nanometer CMOS Integrated Systems: A Survey of Techniques From Circuits to Software , 2016, Proceedings of the IEEE.

[20]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[21]  Gu-Yeon Wei,et al.  Profiling a Warehouse-Scale Computer , 2016, IEEE Micro.

[22]  Frederic Sala,et al.  NSF expedition on variability-aware software: Recent results and contributions , 2015, it Inf. Technol..

[23]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[24]  Puneet Gupta,et al.  ViPZonE: Hardware Power Variability-Aware Virtual Memory Management for Energy Savings , 2015, IEEE Transactions on Computers.

[25]  Puneet Gupta,et al.  Underdesigned and Opportunistic Computing , 2011, 2011 Asian Test Symposium.

[26]  Jeffrey S. Vetter,et al.  A Survey of Techniques for Modeling and Improving Reliability of Computing Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[27]  Lara Dolecek,et al.  Software-Defined Error-Correcting Codes , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).

[28]  Puneet Gupta,et al.  X-Mem: A cross-platform and extensible memory characterization tool for the cloud , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[29]  Zeshan Chishti,et al.  Operating SECDED-based caches at ultra-low voltage with FLAIR , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[30]  Nathan DeBardeleben,et al.  Extra Bits on SRAM and DRAM Errors - More Data from the Field. , 2014 .

[31]  Yiannakis Sazeides,et al.  Modeling the implications of DRAM failures and protection techniques on datacenter TCO , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[32]  Antonio María González Colás,et al.  Low Vccmin fault-tolerant cache with highly predictable performance , 2009, MICRO 2009.

[33]  Wei Wu,et al.  Adaptive Cache Design to Enable Reliable Low-Voltage Operation , 2011, IEEE Transactions on Computers.