论文信息 - Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance

Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance

1. Summary Recent studies estimate that server cost contributes to as much as 57% of the total cost of ownership (TCO) of a datacenter [1]. One key contributor to this high server cost is the procurement of memory devices such as DRAMs, especially for data-intensive datacenter cloud applications that need low latency (such as web search, in-memory caching, and graph traversal). Such memory devices, however, may be prone to hardware errors that occur due to unintended bit flips during device operation [40, 33, 41, 20]. To protect against such errors, traditional systems uniformly employ devices with highquality chips and error correction techniques, both of which increase device cost. At the same time, we make the observations that 1) data-intensive applications exhibit a diverse spectrum of tolerance to memory errors, and 2) traditional one-size-fits-all memory reliability techniques are inefficient in terms of cost. Our DSN-44 paper [30] is the first to 1) understand how tolerant different data-intensive applications are to memory errors and 2) design a new memory system organization that matches hardware reliability to application tolerance in order to reduce system cost. The main idea of our approach is to classify applications based on their memory error tolerance, and map applications to heterogeneous-reliability memory system designs managed cooperatively between hardware and software to reduce system cost. Our DSN-44 paper provides the following contributions: 1. A new methodology to quantify the tolerance of applications to memory errors. Our approach measures the effect of memory errors on application correctness and quantifies an application’s ability to mask or recover from memory errors. 2. A comprehensive characterization of the memory error tolerance of three data-intensive workloads: an interactive web search application [30, 39], an in-memory key‐value store [30, 3], and a graph mining framework [30, 29]. We find that there exists an order of magnitude difference in memory error tolerance across these three applications. 3. An exploration of the design space of new memory system organizations, called heterogeneous-reliability memory, which combines a heterogeneous mix of reliability techniques that leverage application error tolerance to reduce system cost. We show that our techniques can reduce server hardware cost by 4.7%, while achieving 99.90% single server availability.

[1] Eduardo Pinheiro,et al. DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[2] Sudhanva Gurumurthi,et al. Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3] Onur Mutlu,et al. Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4] Onur Mutlu,et al. Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery , 2017, ArXiv.

[5] Hao Wang,et al. Applying Software-based Memory Error Correction for In-Memory Key-Value Store: Case Studies on Memcached and RAMCloud , 2016, MEMSYS.

[6] D. Ielmini,et al. Physical interpretation, modeling and impact on phase change memory (PCM) reliability of resistance drift due to chalcogenide structural relaxation , 2007, 2007 IEEE International Electron Devices Meeting.

[7] Srinivas Devadas,et al. Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8] Dimitris Gizopoulos,et al. Versatile architecture-level fault injection framework for reliability evaluation: A first report , 2014, 2014 IEEE 20th International On-Line Testing Symposium (IOLTS).

[9] Zaid Al-Ars. DRAM fault analysis and test generation , 2005 .

[10] Qiang Wu,et al. A Large-Scale Study of Flash Memory Failures in the Field , 2015, SIGMETRICS 2015.

[11] Sarita V. Adve,et al. Trace-based microarchitecture-level diagnosis of permanent hardware faults , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[12] Onur Mutlu,et al. Phase change memory architecture and the quest for scalability , 2010, Commun. ACM.

[13] Moinuddin K. Qureshi,et al. Citadel , 2016, Encyclopedic Dictionary of Archaeology.

[14] Onur Mutlu,et al. Data retention in MLC NAND flash memory: Characterization, optimization, and recovery , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[15] Jie Liu,et al. Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[16] Onur Mutlu,et al. ChargeCache: Reducing DRAM latency by exploiting row access locality , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[17] Onur Mutlu,et al. Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[18] Jinsuk Chung,et al. CLEAN-ECC: High reliability ECC for adaptive granularity memory system , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19] Tony Tung,et al. Scaling Memcache at Facebook , 2013, NSDI.

[20] Onur Mutlu,et al. A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[21] Abhishek Verma,et al. Large-scale cluster management at Google with Borg , 2015, EuroSys.

[22] Mattan Erez,et al. Frugal ECC: efficient and versatile memory error protection through fine-grained compression , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[23] Lei Liu,et al. Memos: A full hierarchy hybrid memory management framework , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[24] Moinuddin K. Qureshi,et al. XED: Exposing On-Die Error Detection Information for Strong Memory Reliability , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[25] Onur Mutlu,et al. Simultaneous Multi-Layer Access , 2016, ACM Trans. Archit. Code Optim..

[26] Karthik Pattabiraman,et al. Samurai: protecting critical data in unsafe languages , 2008, Eurosys '08.

[27] Onur Mutlu,et al. Research Problems and Opportunities in Memory Systems , 2014, Supercomput. Front. Innov..

[28] Lara Dolecek,et al. Low-Cost Memory Fault Tolerance for IoT Devices , 2017, ACM Trans. Embed. Comput. Syst..

[29] Luis Ceze,et al. Architecture support for disciplined approximate programming , 2012, ASPLOS XVII.

[30] Onur Mutlu,et al. AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[31] Mendel Rosenblum,et al. Fast crash recovery in RAMCloud , 2011, SOSP.

[32] Mattan Erez,et al. Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[33] Onur Mutlu,et al. Online design bug detection: RTL analysis, flexible mechanisms, and evaluation , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[34] Qiang Wu,et al. Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[35] Chris Fallin,et al. Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[36] Onur Mutlu,et al. Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[37] Xin Li,et al. A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility , 2010, USENIX Annual Technical Conference.

[38] Jun Yang,et al. Phase-Change Technology and the Future of Main Memory , 2010, IEEE Micro.

[39] Christoforos E. Kozyrakis,et al. Towards energy-proportional datacenter memory with mobile DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[40] Hongzhong Zheng,et al. Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling , 2014 .

[41] Jing Li,et al. Evaluating Row Buffer Locality in Future Non-Volatile Main Memories , 2018, ArXiv.

[42] Onur Mutlu,et al. Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[43] Karsten Schwan,et al. Data tiering in heterogeneous memory systems , 2016, EuroSys.

[44] Onur Mutlu,et al. Concurrent autonomous self-test for uncore components in system-on-chips , 2010, 2010 28th VLSI Test Symposium (VTS).

[45] Pavan Balaji,et al. Toward the efficient use of multiple explicitly managed memory subsystems , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[46] Parag Agrawal,et al. The case for RAMCloud , 2011, Commun. ACM.

[47] Xin Xu,et al. Understanding soft error propagation using Efficient vulnerability-driven fault injection , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[48] Alfredo Benso,et al. A C/C++ source-to-source compiler for dependable applications , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[49] Sriram Sankar,et al. Server Engineering Insights for Large-Scale Online Services , 2010, IEEE Micro.

[50] Joseph M. Hellerstein,et al. Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[51] Onur Mutlu,et al. A Case for Memory Content-Based Detection and Mitigation of Data-Dependent Failures in DRAM , 2017, IEEE Computer Architecture Letters.

[52] Michael C. Huang,et al. Redundant memory array architecture for efficient selective protection , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[53] Onur Mutlu,et al. Adaptive-latency DRAM: Optimizing DRAM timing for the common-case , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[54] Olaf Spinczyk,et al. Generative software-based memory error detection and correction for operating system data structures , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[55] Alan Messer,et al. Susceptibility of commodity systems and software to memory soft errors , 2004, IEEE Transactions on Computers.

[56] Onur Mutlu,et al. The RowHammer problem and other issues we may face as memory becomes denser , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[57] Onur Mutlu,et al. An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms , 2013, ISCA.

[58] Onur Mutlu,et al. Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation , 2013, ICCD.

[59] Mahmut T. Kandemir,et al. Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[60] Yixin Luo,et al. Improving the reliability of chip-off forensic analysis of NAND flash memory devices , 2017, Digit. Investig..

[61] Osman S. Unsal,et al. Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[62] O. Mutlu,et al. Enabling Accurate and Practical Online Flash Channel Modeling for Modern MLC NAND Flash Memory , 2016, IEEE Journal on Selected Areas in Communications.

[63] Doe Hyun Yoon,et al. Virtualized and flexible ECC for main memory , 2010, ASPLOS XV.

[64] Onur Mutlu,et al. HeatWatch: Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature Awareness , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[65] Onur Mutlu,et al. Threshold voltage distribution in MLC NAND flash memory: Characterization, analysis, and modeling , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[66] Taniya Siddiqua,et al. Analysis and Modeling of Memory Errors from Large-scale Field Data Collection , 2013 .

[67] Jie Liu,et al. No more electrical infrastructure: towards fuel cell powered data centers , 2013, HotPower '13.

[68] Sarita V. Adve,et al. Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[69] A. Kleen. mcelog : memory error handling in user space , 2010 .

[70] Vilas Sridharan,et al. A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[71] David I. August,et al. Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[72] Bianca Schroeder,et al. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[73] Kushagra Vaid,et al. Web search using mobile cores: quantifying and mitigating the price of efficiency , 2010, ISCA.

[74] Tao Li,et al. Exploring Phase Change Memory and 3D Die-Stacking for Power/Thermal Friendly, Fast and Durable Memory Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[75] Mor Harchol-Balter,et al. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[76] Thomas F. Wenisch,et al. Thermostat: Application-transparent Page Management for Two-tiered Main Memory , 2017, ASPLOS.

[77] 王晶,et al. Heterogeneous Energy-Efficient Cache Design in Warehouse Scale Computers , 2015 .

[78] Onur Mutlu,et al. The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study , 2014, SIGMETRICS '14.

[79] Onur Mutlu,et al. Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives , 2017, Proceedings of the IEEE.

[80] Kathryn S. McKinley,et al. Uncertain: a first-order type for uncertain data , 2014, ASPLOS.

[81] Zhen Fang,et al. Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[82] Mahmut T. Kandemir,et al. Performance and Energy Efficient Asymmetrically Reliable Caches for Multicore Architectures , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[83] Lizy Kurian John,et al. Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[84] Vesselina K. Papazova,et al. IBM zEnterprise redundant array of independent memory subsystem , 2012, IBM J. Res. Dev..

[85] Onur Mutlu,et al. A Flexible Software-Based Framework for Online Detection of Hardware Defects , 2009, IEEE Transactions on Computers.

[86] Mattan Erez,et al. All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[87] Thomas P. Parnell,et al. Modelling of the threshold voltage distributions of sub-20nm NAND flash memory , 2014, 2014 IEEE Global Communications Conference.

[88] Kevin K. Chang,et al. Understanding and Improving the Latency of DRAM-Based Memory Systems , 2017, ArXiv.

[89] Bianca Schroeder,et al. Temperature management in data centers: why some (might) like it hot , 2012, SIGMETRICS '12.

[90] Yoongu Kim,et al. Architectural Techniques to Enhance DRAM Scaling , 2018 .

[91] Donald Yeung,et al. Application-Level Correctness and its Impact on Fault Tolerance , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[92] Dae-Hyun Kim,et al. ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates , 2013, ISCA.

[93] Luiz André Barroso,et al. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[94] Rachata Ausavarungnirun,et al. Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms , 2017, SIGMETRICS.

[95] Yi-Min Wang,et al. Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[96] Onur Mutlu,et al. PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[97] Keke Gai,et al. Smart Energy-Aware Data Allocation for Heterogeneous Memory , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[98] Onur Mutlu,et al. A Case for Effic ient Hardware/Soft ware Cooperative Management of Storage and Memory , 2013 .

[99] Robbert van Renesse,et al. An analysis of Facebook photo caching , 2013, SOSP.

[100] Subhasish Mitra,et al. ERSA: Error Resilient System Architecture for probabilistic applications , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[101] Carlos Guestrin,et al. Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[102] Norbert Wehn,et al. Reverse Engineering of DRAMs: Row Hammer with Crosshair , 2016, MEMSYS.

[103] Onur Mutlu,et al. Using ECC DRAM to Adaptively Increase Memory Capacity , 2017, ArXiv.

[104] Onur Mutlu,et al. Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization , 2016, SIGMETRICS.

[105] Dong Li,et al. Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[106] Onur Mutlu,et al. Understanding Reduced-Voltage Operation in Modern DRAM Devices , 2017, Proc. ACM Meas. Anal. Comput. Syst..

[107] Pedro Trancoso,et al. Odd-ECC: on-demand DRAM error correcting codes , 2017, MEMSYS.

[108] John Shalf,et al. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly , 2015, ASPLOS.

[109] Onur Mutlu,et al. Improving DRAM performance by parallelizing refreshes with accesses , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[110] Subhasish Mitra,et al. CASP: Concurrent Autonomous Chip Self-Test Using Stored Test Patterns , 2008, 2008 Design, Automation and Test in Europe.

[111] Rami G. Melhem,et al. Concurrent Migration of Multiple Pages in software-managed hybrid main memory , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[112] Onur Mutlu,et al. Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[113] Ricardo Bianchini,et al. Page placement in hybrid memory systems , 2011, ICS '11.

[114] Song Liu,et al. Flikker: saving DRAM refresh-power through critical data partitioning , 2011, ASPLOS XVI.

[115] Mor Harchol-Balter,et al. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[116] Jin Sun,et al. Utility-Based Hybrid Memory Management , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[117] Onur Mutlu,et al. Tiered-latency DRAM: A low latency and low cost DRAM architecture , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[118] Puneet Gupta,et al. Measuring the Impact of Memory Errors on Application Performance , 2017, IEEE Computer Architecture Letters.

[119] Onur Mutlu,et al. Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[120] Omer Subasi,et al. Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[121] Onur Mutlu,et al. Memory scaling: A systems architecture perspective , 2013, 2013 5th IEEE International Memory Workshop.

[122] Osman S. Unsal,et al. Neighbor-cell assisted error correction for MLC NAND flash memories , 2014, SIGMETRICS '14.

[123] Rolf Riesen,et al. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[124] Onur Mutlu,et al. Operating system scheduling for efficient online self-test in robust systems , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[125] Richard Veras,et al. RAIDR: Retention-aware intelligent DRAM refresh , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[126] S. Phadke,et al. MLP aware heterogeneous memory system , 2011, 2011 Design, Automation & Test in Europe.

[127] Dong Tang,et al. Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[128] Vijayalakshmi Srinivasan,et al. Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[129] Michael Engel,et al. RAMpage: Graceful Degradation Management for Memory Errors in Commodity Linux Servers , 2011, 2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing.

[130] Onur Mutlu,et al. Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management , 2012, IEEE Computer Architecture Letters.

[131] Rachata Ausavarungnirun,et al. RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data , 2013 .

[132] Jichuan Chang,et al. BOOM: Enabling mobile memory based low-power server DIMMs , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[133] Norbert Wehn,et al. Exploiting expendable process-margins in DRAMs for run-time performance optimization , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[134] Moinuddin K. Qureshi,et al. Reducing Refresh Power in Mobile Devices with Morphable ECC , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[135] Jongmoo Choi,et al. Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[136] T. May,et al. Alpha-particle-induced soft errors in dynamic memories , 1979, IEEE Transactions on Electron Devices.

[137] Onur Mutlu,et al. Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[138] Onur Mutlu,et al. SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[139] Rachata Ausavarungnirun,et al. Row buffer locality aware caching policies for hybrid memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[140] Onur Mutlu,et al. Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[141] Onur Mutlu,et al. ERRoR ANAlysIs AND RETENTIoN-AwARE ERRoR MANAgEMENT FoR NAND FlAsh MEMoRy , 2013 .

[142] Timothy J. Dell,et al. A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[143] Onur Mutlu,et al. Detecting and Mitigating Data-Dependent DRAM Failures by Exploiting Current Memory Content , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).