SPMCloud: Towards the Single-Chip Embedded ScratchPad Memory-Based Storage Cloud

The era of cloud computing on-a-chip is enabled by the aggressive move towards many-core platforms and the rapid adoption of Network-on-Chips. As a result, there is a need for large-scale distributed on-chip shared memories that are reliable, low power, and seamlessly manageable. In this work, we propose SPMCloud, a novel scratchpad-memory-based cloud-inspired volatile storage subsystem designed to meet the needs of future-generation many-core platforms. SPMCloud is composed of several concepts, including: (1) a highly scalable data-center-like memory subsystem that exploits two enterprise-network-inspired memory configurations, namely, embedded Network Attached Storage (eNAS) and embedded Storage Area Network (eSAN), and (2) on-demand allocation of reliable memory space through memory virtualization and the use of embedded RAIDs. Our experimental results on Mediabench/CHStone benchmarks show that the SPMCloud's fully distributed reliable memory subsystems can achieve 48% energy savings and 70% latency reduction on average over state-of-the-art NoC memory reliability techniques. We then evaluate the scalability of the SPMCloud and compare it with traditional SPM allocation policies. The SPMCloud's dynamic allocator outperforms the best competition by an average 60% (eNAS) and 46% (eSAN) when the platform runs at 250 MHz and by an average 80% (eNAS) and 40% when running at 1 GHz. Moreover, the SPMCloud achieves an average 83% energy savings across all configurations (number of cores) with respect to the best competitors when running at 250 MHz and 1 GHz. We then studied the SPM hit ratio across the various allocation policies discussed in this article and showed that on average the SPMCloud's priority-driven dynamic allocation policy achieves 93.5% SPM hit ratio, 0.6% higher hit ratio than the closest allocation policy. We then showed that the eNAS and eSAN achieve an average of 67.9% and 29% reduction in execution time, respectively, over the best competitor. Similarly, the eNAS and eSAN achieve an average of 82.7% and 82.3% energy savings, respectively, over the best competitor. Furthermore, we evaluated the scalability of the SPMCloud and its performance/energy efficiency when providing support for some of the heavier E-RAID levels, and showed that the eNAS/eSAN configurations with SECDED achieve an average of 51.5% and 34.9% reduction in execution time, respectively, over the best competitor with SECDED. Similarly, the eNAS/eSAN configurations with E-RAID Level 1, + SECDED achieve an average of 82.3% and 75.6% energy savings, respectively, over the best competitor.

[1]  Nur A. Touba,et al.  Reducing power consumption in memory ECC checkers , 2004, 2004 International Conferce on Test.

[2]  Erik Brockmeyer,et al.  Data reuse analysis technique for software-controlled memory hierarchies , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[3]  Georg Georgakos,et al.  Soft Error Rates in 65nm SRAMs--Analysis of new Phenomena , 2007, 13th IEEE International On-Line Testing Symposium (IOLTS 2007).

[4]  Luca Benini,et al.  An OpenMP Compiler for Efficient Use of Distributed Scratchpad Memory in MPSoCs , 2012, IEEE Transactions on Computers.

[5]  Saurabh Dighe,et al.  The 48-core SCC Processor: the Programmer's View , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Edward J. McCluskey,et al.  PADded cache: a new fault-tolerance technique for cache memories , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[7]  Farshad Moradi,et al.  65NM sub-threshold 11T-SRAM for ultra low voltage applications , 2008, 2008 IEEE International SOC Conference.

[8]  Gerald E. Sobelman,et al.  Network-on-chip quality-of-service through multiprotocol label switching , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[9]  Wei Wu,et al.  Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.

[10]  Nikil D. Dutt,et al.  A Multi-Granularity Power Modeling Methodology for Embedded Processors , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[11]  Howard Leo Kalter,et al.  A 50-ns 16-Mb DRAM with a 10-ns data rate and on-chip ECC , 1990 .

[12]  Yunheung Paek,et al.  Compiler driven data layout optimization for regular/irregular array access patterns , 2008, LCTES '08.

[13]  Erik Brockmeyer,et al.  Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[14]  Tien-Fu Chen,et al.  No cache-coherence: A single-cycle ring interconnection for multi-core L1-NUCA sharing on 3D chips , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[15]  M. A. Lucente,et al.  Memory system reliability improvement through associative cache redundancy , 1990, IEEE Proceedings of the Custom Integrated Circuits Conference.

[16]  Amin Ansari,et al.  Enabling ultra low voltage system operation by tolerating on-chip cache failures , 2009, ISLPED.

[17]  Coniferous softwood GENERAL TERMS , 2003 .

[18]  Nikil D. Dutt,et al.  Efficient utilization of scratch-pad memory in embedded processor applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.

[19]  Doe Hyun Yoon,et al.  Virtualized and flexible ECC for main memory , 2010, ASPLOS 2010.

[20]  Tulika Mitra,et al.  Integrated scratchpad memory optimization and task scheduling for MPSoC architectures , 2006, CASES '06.

[21]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[22]  José Duato,et al.  Region-Based Routing: A Mechanism to Support Efficient Routing Algorithms in NoCs , 2009 .

[23]  Srinivas Devadas,et al.  ARCc: A case for an architecturally redundant cache-coherence architecture for large multicores , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).

[24]  A.P. Chandrakasan,et al.  A 256-kb 65-nm Sub-threshold SRAM Design for Ultra-Low-Voltage Operation , 2007, IEEE Journal of Solid-State Circuits.

[25]  Doe Hyun Yoon,et al.  Memory mapped ECC: low-cost error protection for last level caches , 2009, ISCA '09.

[26]  Wei Zhang,et al.  Enhancing data cache reliability by the addition of a small fully-associative replication cache , 2004, ICS '04.

[27]  Hiroaki Takada,et al.  Minimizing inter-task interferences in scratch-pad memory usage for reducing the energy consumption of multi-task systems , 2010, CASES '10.

[28]  Wolfgang Schröder-Preikschat,et al.  DistRM: Distributed resource management for on-chip many-core systems , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[29]  Amin Ansari,et al.  ZerehCache: Armoring cache architectures in high defect density technologies , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30]  Nikil D. Dutt,et al.  FFT-Cache: A Flexible Fault-Tolerant Cache architecture for ultra low voltage operation , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[31]  Heonshik Shin,et al.  Dynamic scratchpad memory management for code in portable systems with an MMU , 2008, TECS.

[32]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[33]  Nikil D. Dutt,et al.  E-RoC: Embedded RAIDs-on-Chip for low power distributed dynamically managed reliable memories , 2011, 2011 Design, Automation & Test in Europe.

[34]  N. Okumura,et al.  A 600 MHz single-chip multiprocessor with 4.8 GB/s internal shared pipelined bus and 512 kB internal memory , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[35]  Giorgos Dimitrakopoulos,et al.  LP-NUCA: Networks-in-Cache for High-Performance Low-Power Embedded Processors , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[36]  Mahmut T. Kandemir,et al.  Improving scratch-pad memory reliability through compiler-guided data block duplication , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[37]  Sani R. Nassif,et al.  Modeling and analysis of manufacturing variations , 2001, Proceedings of the IEEE 2001 Custom Integrated Circuits Conference (Cat. No.01CH37169).

[38]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[39]  Puneet Gupta,et al.  VaMV: Variability-aware Memory Virtualization , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[40]  Luca Benini,et al.  Reliability Support for On-Chip Memories Using Networks-on-Chip , 2006, 2006 International Conference on Computer Design.

[41]  Antonio González,et al.  The auction: optimizing banks usage in Non-Uniform Cache Architectures , 2010, ICS '10.

[42]  Sang Lyul Min,et al.  Scratchpad Memory Management Techniques for Code in Embedded Systems without an MMU , 2010, IEEE Transactions on Computers.

[43]  Hiroaki Takada,et al.  Partitioning and allocation of scratch-pad memory for priority-based preemptive multi-task systems , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[44]  Shuming Chen,et al.  Supporting Distributed Shared Memory on multi-core Network-on-Chips using a dual microcoded controller , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[45]  Robert J. T. Morris,et al.  The evolution of storage systems , 2003, IBM Syst. J..

[46]  Mohamed Shalan,et al.  A dynamic memory management unit for embedded real-time system-on-a-chip , 2000, CASES '00.

[47]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[48]  Howard Jay Siegel,et al.  OE+IOE: A novel turn model based fault tolerant routing scheme for networks-on-chip , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[49]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[50]  Narayanan Vijaykrishnan,et al.  Working with Process Variation Aware Caches , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[51]  Avesta Sasan,et al.  A fault tolerant cache architecture for sub 500mV operation: resizable data composer cache (RDC-cache) , 2009, CASES '09.

[52]  Kaushik Roy,et al.  A 160 mV, fully differential, robust schmitt trigger based sub-threshold SRAM , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[53]  Luca Benini,et al.  Analysis of error recovery schemes for networks on chips , 2005, IEEE Design & Test of Computers.

[54]  Aviral Shrivastava,et al.  Dynamic code mapping for limited local memory systems , 2010, ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors.

[55]  Chaitali Chakrabarti,et al.  Energy-aware error control coding for Flash memories , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[56]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[57]  Peter Marwedel,et al.  Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications , 2007, SCOPES '07.

[58]  B. Granbom,et al.  Soft error rate increase for new generations of SRAMs , 2003 .

[59]  Mahmut T. Kandemir,et al.  Dynamic management of scratch-pad memory space , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[60]  Abhishek Das,et al.  PAD: Power-Aware Directory Placement in Distributed Caches , 2010 .

[61]  Peter Marwedel,et al.  Data partitioning for maximal scratchpad usage , 2003, ASP-DAC '03.

[62]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[63]  Doe Hyun Yoon,et al.  Virtualized and flexible ECC for main memory , 2010, ASPLOS XV.

[64]  Srinivas Devadas,et al.  DCC: A Dependable Cache Coherence Multicore Architecture , 2011, IEEE Computer Architecture Letters.

[65]  Tulika Mitra,et al.  Scratchpad allocation for concurrent embedded software , 2010, TOPL.

[66]  Hyunjin Lee,et al.  CloudCache: Expanding and shrinking private caches , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[67]  Ahmed M. Eltawil,et al.  Low-Power Multimedia System Design by Aggressive Voltage Scaling , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[68]  Aviral Shrivastava,et al.  Heap data management for limited local memory (LLM) multi-core processors , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[69]  Luca Benini,et al.  Platform 2012, a many-core computing accelerator for embedded SoCs: Performance evaluation of visual analytics applications , 2012, DAC Design Automation Conference 2012.

[70]  Antonio González,et al.  LRU-PEA: A smart replacement policy for non-uniform cache architectures on chip multiprocessors , 2009, 2009 IEEE International Conference on Computer Design.

[71]  Nikil D. Dutt,et al.  Inter-kernel data reuse and pipelining on chip-multiprocessors for multimedia applications , 2009, 2009 IEEE/ACM/IFIP 7th Workshop on Embedded Systems for Real-Time Multimedia.

[72]  M. Sachdev,et al.  A multiword based high speed ECC scheme for low-voltage embedded SRAMS , 2008, ESSCIRC 2008 - 34th European Solid-State Circuits Conference.

[73]  Nikil D. Dutt,et al.  Towards Embedded RAIDs-on-Chip , 2011 .

[74]  Wolfgang Rosenstiel,et al.  Fully Adaptive Fault-Tolerant Routing Algorithm for Network-on-Chip Architectures , 2007 .

[75]  Hiroyuki Tomiyama,et al.  CHStone: A benchmark program suite for practical C-based high-level synthesis , 2008, 2008 IEEE International Symposium on Circuits and Systems.

[76]  Anna W. Topol,et al.  Stable SRAM cell design for the 32 nm node and beyond , 2005, Digest of Technical Papers. 2005 Symposium on VLSI Technology, 2005..

[77]  Gu-Yeon Wei,et al.  Process Variation Tolerant 3T1D-Based Cache Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[78]  Luca Benini,et al.  Error control schemes for on-chip communication links: the energy-reliability tradeoff , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[79]  Aviral Shrivastava,et al.  Mitigating soft error failures for multimedia applications by selective data protection , 2006, CASES '06.

[80]  K. Ishibashi,et al.  16.7 fA/cell tunnel-leakage-suppressed 16 Mb SRAM for handling cosmic-ray-induced multi-errors , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[81]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[82]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[83]  Alaa R. Alameldeen,et al.  Trading off Cache Capacity for Reliability to Enable Low Voltage Operation , 2008, 2008 International Symposium on Computer Architecture.

[84]  Wei Zhang,et al.  ICR: in-cache replication for enhancing data cache reliability , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[85]  Avesta Sasan,et al.  Process Variation Aware SRAM/Cache for aggressive voltage-frequency scaling , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[86]  Nikil D. Dutt,et al.  SPMVisor: Dynamic scratchpad memory virtualization for secure, low power, and high performance distributed on-chip memories , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[87]  Gianluca Palermo,et al.  Exploration of Distributed Shared Memory Architectures for NoC-based Multiprocessors , 2006, 2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[88]  Luca Benini,et al.  Networks on Chips : A New SoC Paradigm , 2022 .

[89]  Hai Zhou,et al.  Parallel CAD: Algorithm Design and Programming Special Section Call for Papers TODAES: ACM Transactions on Design Automation of Electronic Systems , 2010 .

[90]  Pierfrancesco Foglia,et al.  A NUCA model for embedded systems cache design , 2005, 3rd Workshop on Embedded Systems for Real-Time Multimedia, 2005..

[91]  Doug Burger,et al.  Implementation and Evaluation of On-Chip Network Architectures , 2006, 2006 International Conference on Computer Design.

[92]  Shuming Chen,et al.  Run-Time Partitioning of Hybrid Distributed Shared Memory on Multi-core Network-on-Chips , 2010, 2010 3rd International Symposium on Parallel Architectures, Algorithms and Programming.

[93]  Avesta Sasan,et al.  Limits on voltage scaling for caches utilizing fault tolerant techniques , 2007, 2007 25th International Conference on Computer Design.

[94]  Yong-Bin Kim,et al.  Fault Tolerant Source Routing for Network-on-chip , 2007, 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT 2007).

[95]  Luca Benini,et al.  An integrated hardware/software approach for run-time scratchpad management , 2004, Proceedings. 41st Design Automation Conference, 2004..

[96]  Andrew B. Kahng,et al.  ORION 2.0: A Power-Area Simulator for Interconnection Networks , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[97]  Nikil D. Dutt,et al.  HaVOC: A hybrid memory-aware virtualization layer for on-chip distributed ScratchPad and Non-Volatile Memories , 2012, DAC Design Automation Conference 2012.

[98]  Nikil D. Dutt,et al.  E < MC2: less energy through multi-copy cache , 2010, CASES '10.