Reliability Enhancement of SoCs Based on Dynamic Memory Access Profiling in Conjunction with PVT Monitoring

The growing technology scaling and larger die size of multi-processor System-On-Chip have increased the error rates for on-chip memories. Increased system speed for high performance, aggressive voltages scaling for power reduction and intra-die process variation have exaggerated the unreliability issue. In this paper a method for memory management on SoCs to enhance their reliability is discussed. The method consists of a mechanism for automatically moving the contents of a less reliable memory to a more reliable memory. The solution module designed as RAIMM (Reliability Aware Intelligent Memory Management) is an architectural framework to dynamically compute reliability of the on-chip memories and provide a better reliable solution for the application in case of any memory failure. The silicon characterization data is used in conjunction with the on-chip process/voltage/temperature sensors to correctly estimate the memory reliability status. It provides a ranking mechanism for the available memories based on the operating conditions, silicon characterization data as well as dynamic access profiling data, which can be used to provide a method to accurately predict memory failure in advance to the application. An efficiently hardware programmed Direct Memory Access (DMA) engine ensures the efficient working of overall application with low overhead for software in maintaining the memory configuration and contents.

[1]  Dhiraj K. Pradhan,et al.  The Effect of Program Behavior on Fault Observability , 1996, IEEE Trans. Computers.

[2]  Mario Blaum,et al.  The Reliability of Single-Error Protected Computer Memories , 1988, IEEE Trans. Computers.

[3]  Luca Benini,et al.  Reliability Support for On-Chip Memories Using Networks-on-Chip , 2006, 2006 International Conference on Computer Design.

[4]  Robert C. Aitken,et al.  Impact of voltage scaling on nanoscale SRAM reliability , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[5]  Wei Zhang,et al.  ICR: in-cache replication for enhancing data cache reliability , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[6]  Yong-Bin Kim,et al.  Optimal spare utilization in repairable and reliable memory cores , 2003, Records of the 2003 International Workshop on Memory Technology, Design and Testing.

[7]  Mehrdad Nourani,et al.  Testing On-Die Process Variation in Nanometer VLSI , 2006, IEEE Design & Test of Computers.

[8]  Doug Burger,et al.  Exploiting microarchitectural redundancy for defect tolerance , 2003, Proceedings 21st International Conference on Computer Design.

[9]  Nikil D. Dutt,et al.  E-RoC: Embedded RAIDs-on-Chip for low power distributed dynamically managed reliable memories , 2011, 2011 Design, Automation & Test in Europe.

[10]  Hideto Hidaka,et al.  A shared built-in self-repair analysis for multiple embedded memories , 2001, Proceedings of the IEEE 2001 Custom Integrated Circuits Conference (Cat. No.01CH37169).

[11]  David Blaauw,et al.  Opportunities and challenges for better than worst-case design , 2005, ASP-DAC.

[12]  Hua Wang,et al.  Systematic analysis of energy and delay impact of very deep submicron process variability effects in embedded SRAM modules , 2005, Design, Automation and Test in Europe.