论文信息 - Measuring the Impact of Memory Errors on Application Performance

Measuring the Impact of Memory Errors on Application Performance

Memory reliability is a key factor in the design of warehouse-scale computers. Prior work has focused on the performance overheads of memory fault-tolerance schemes when errors do not occur at all, and when detected but uncorrectable errors occur, which result in machine downtime and loss of availability. We focus on a common third scenario, namely, situations when hard but correctable faults exist in memory; these may cause an “avalanche” of errors to occur on affected hardware. We expose how the hardware/software mechanisms for managing and reporting memory errors can cause severe performance degradation in systems suffering from hardware faults. We inject faults in DRAM on a real cloud server and quantify the single-machine performance degradation for both batch and interactive workloads. We observe that for SPEC CPU2006 benchmarks, memory errors can slow down average execution time by up to 2.5<inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives> <inline-graphic xlink:href="gottscho-ieq1-2599513.gif"/></alternatives></inline-formula>. For an interactive web-search workload, average query latency degrades by up to 2.3<inline-formula><tex-math notation="LaTeX">$\times$</tex-math> <alternatives><inline-graphic xlink:href="gottscho-ieq2-2599513.gif"/></alternatives></inline-formula> for a light traffic load, and up to an extreme 3746<inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives> <inline-graphic xlink:href="gottscho-ieq3-2599513.gif"/></alternatives></inline-formula> under peak load. Our analyses of the memory error-reporting stack reveals architecture, firmware, and software opportunities to improve performance consistency by mitigating the worst-case behavior on faulty hardware.

[1] Ben Maurer. Fail at scale , 2015, Commun. ACM.

[2] Guan Qiang,et al. Improving DRAM Fault Characterization through Machine Learning , 2016 .

[3] Lara Dolecek,et al. Underdesigned and Opportunistic Computing in Presence of Hardware Variability , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[4] Song Liu,et al. Flikker: saving DRAM refresh-power through critical data partitioning , 2011, ASPLOS XVI.

[5] Ke Chen,et al. System implications of memory reliability in exascale computing , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6] Qiang Wu,et al. Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[7] Sparsh Mittal. A Survey of Architectural Techniques for Managing Process Variation , 2016, ACM Comput. Surv..

[8] Jie Liu,et al. Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[9] Eduardo Pinheiro,et al. DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[10] Sudhanva Gurumurthi,et al. Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11] Puneet Gupta,et al. DPCS: Dynamic Power/Capacity Scaling for SRAM Caches in the Nanoscale Era , 2015, ACM Trans. Archit. Code Optim..

[12] Amin Ansari,et al. Archipelago: A polymorphic cache design for enabling robust near-threshold operation , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[13] W. W. Peterson,et al. Error-Correcting Codes. , 1962 .

[14] Xin Li,et al. A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility , 2010, USENIX Annual Technical Conference.

[15] Long Chen,et al. E3CC: A memory error protection scheme with novel address mapping for subranked and low-power memories , 2013, ACM Trans. Archit. Code Optim..

[16] Vilas Sridharan,et al. A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[17] Randy H. Katz,et al. A view of cloud computing , 2010, CACM.

[18] Karen L. Karavanic,et al. Performance implications of System Management Mode , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[19] Luca Benini,et al. Variability Mitigation in Nanometer CMOS Integrated Systems: A Survey of Techniques From Circuits to Software , 2016, Proceedings of the IEEE.

[20] Bianca Schroeder,et al. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[21] Gu-Yeon Wei,et al. Profiling a Warehouse-Scale Computer , 2016, IEEE Micro.

[22] Frederic Sala,et al. NSF expedition on variability-aware software: Recent results and contributions , 2015, it Inf. Technol..

[23] Kashi Venkatesh Vishwanath,et al. Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[24] Puneet Gupta,et al. ViPZonE: Hardware Power Variability-Aware Virtual Memory Management for Energy Savings , 2015, IEEE Transactions on Computers.

[25] Puneet Gupta,et al. Underdesigned and Opportunistic Computing , 2011, 2011 Asian Test Symposium.

[26] Jeffrey S. Vetter,et al. A Survey of Techniques for Modeling and Improving Reliability of Computing Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[27] Lara Dolecek,et al. Software-Defined Error-Correcting Codes , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).

[28] Puneet Gupta,et al. X-Mem: A cross-platform and extensible memory characterization tool for the cloud , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[29] Zeshan Chishti,et al. Operating SECDED-based caches at ultra-low voltage with FLAIR , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[30] Nathan DeBardeleben,et al. Extra Bits on SRAM and DRAM Errors - More Data from the Field. , 2014 .

[31] Yiannakis Sazeides,et al. Modeling the implications of DRAM failures and protection techniques on datacenter TCO , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[32] Antonio María González Colás,et al. Low Vccmin fault-tolerant cache with highly predictable performance , 2009, MICRO 2009.

[33] Wei Wu,et al. Adaptive Cache Design to Enable Reliable Low-Voltage Operation , 2011, IEEE Transactions on Computers.