Software Controlled Memories for Scalable Many-Core Architectures

Technology scaling along with the ever evolving demand for media-rich software stacks have motivated the need for many-core platforms. With the increase in compute power and its inherent demand for high memory bandwidth comes the need for vast amounts of on-chip memory space. Thus, designers must carefully provision the memory real-estate to meet their application's needs. It has been shown in the embedded systems domain that both software controlled memories (e.g., scratchpad memories) and hardware-controlled memories (e.g., caches) have their pros and cons, some application domains such as multimedia fit very well in the software-controlled memory model, while other domains such as databases work well with caches. As a result, efficient memory management is extremely critical as it has a great impact on the system's power consumption and throughput. Traditional memory hierarchies primarily consist of SRAM-based on-chip caches, however, with the emergence of non-volatile memories (NVMs) and mixed-criticality systems, on-chip memories will be heterogeneous, not only in type (cache vs. scratchpad) but also in technology (e.g., SRAM vs. NVM). This paper surveys the state of the art in memory subsystems for many-core platforms, and presents strategies for efficiently managing software-controlled memories in the many-core domain, while addressing the various challenges designers face in deploying such memory subsystems (e.g., sharing the memory resources, accounting for variations in the subsystem, etc.).

[1]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[2]  Simcha Gochman,et al.  Introduction to Intel Core Duo Processor Architecture , 2006 .

[3]  Nur A. Touba,et al.  Reducing power consumption in memory ECC checkers , 2004, 2004 International Conferce on Test.

[4]  Shuming Chen,et al.  Run-Time Partitioning of Hybrid Distributed Shared Memory on Multi-core Network-on-Chips , 2010, 2010 3rd International Symposium on Parallel Architectures, Algorithms and Programming.

[5]  Puneet Gupta,et al.  A case for opportunistic embedded sensing in presence of hardware power variability , 2010 .

[6]  Mohamed Shalan,et al.  A dynamic memory management unit for embedded real-time system-on-a-chip , 2000, CASES '00.

[7]  Antonio González,et al.  LRU-PEA: A smart replacement policy for non-uniform cache architectures on chip multiprocessors , 2009, 2009 IEEE International Conference on Computer Design.

[8]  Sang Lyul Min,et al.  Scratchpad Memory Management Techniques for Code in Embedded Systems without an MMU , 2010, IEEE Transactions on Computers.

[9]  Jun Yang,et al.  A durable and energy efficient main memory using phase change memory technology , 2009, ISCA '09.

[10]  Hiroaki Takada,et al.  Partitioning and allocation of scratch-pad memory for priority-based preemptive multi-task systems , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[11]  Shuming Chen,et al.  Supporting Distributed Shared Memory on multi-core Network-on-Chips using a dual microcoded controller , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[12]  Yiran Chen,et al.  A novel architecture of the 3D stacked MRAM L2 cache for CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[13]  Sani R. Nassif,et al.  High Performance CMOS Variability in the 65nm Regime and Beyond , 2006, 2007 IEEE International Electron Devices Meeting.

[14]  Minming Li,et al.  Power-Aware Variable Partitioning for DSPs With Hybrid PRAM and DRAM Main Memory , 2011, IEEE Transactions on Signal Processing.

[15]  David A. Patterson,et al.  Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) , 2008 .

[16]  George Kurian,et al.  ATAC: A 1000-core cache-coherent processor with on-chip optical network , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  Sani R. Nassif,et al.  Power variability and its impact on design , 2005, 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design.

[18]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[19]  Christoforos E. Kozyrakis,et al.  Comparing memory systems for chip multiprocessors , 2007, ISCA '07.

[20]  Bryan D. Ackland,et al.  A single-chip 1.6 billion 16-b MAC/s multiprocessor DSP , 1999 .

[21]  Nikil Dutt,et al.  Philosoftware: a low power, high performance, reliable, and secure virtualization layer for on-chip software-controlled memories , 2012 .

[22]  Paolo Faraboschi,et al.  Operating System Support for NVM+DRAM Hybrid Main Memory , 2009, HotOS.

[23]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[24]  Nikil D. Dutt,et al.  E-RoC: Embedded RAIDs-on-Chip for low power distributed dynamically managed reliable memories , 2011, 2011 Design, Automation & Test in Europe.

[25]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[26]  Alaa R. Alameldeen,et al.  Trading off Cache Capacity for Reliability to Enable Low Voltage Operation , 2008, 2008 International Symposium on Computer Architecture.

[27]  Yan Solihin,et al.  QoS policies and architecture for cache/memory in CMP platforms , 2007, SIGMETRICS '07.

[28]  Ahmed Amine Jerraya,et al.  Multiprocessor System-on-Chip (MPSoC) Technology , 2008, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[29]  Wei-Che Tseng,et al.  Towards energy efficient hybrid on-chip Scratch Pad Memory with non-volatile memory , 2011, 2011 Design, Automation & Test in Europe.

[30]  Wei Zhang,et al.  Enhancing data cache reliability by the addition of a small fully-associative replication cache , 2004, ICS '04.

[31]  Mahmut T. Kandemir,et al.  Improving scratch-pad memory reliability through compiler-guided data block duplication , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[32]  Peter Marwedel,et al.  Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications , 2007, SCOPES '07.

[33]  Ram Huggahalli,et al.  Impact of Cache Coherence Protocols on the Processing of Network Traffic , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[34]  B. Granbom,et al.  Soft error rate increase for new generations of SRAMs , 2003 .

[35]  Hiroaki Takada,et al.  Minimizing inter-task interferences in scratch-pad memory usage for reducing the energy consumption of multi-task systems , 2010, CASES '10.

[36]  Puneet Gupta,et al.  Variation-aware speed binning of multi-core processors , 2010, 2010 11th International Symposium on Quality Electronic Design (ISQED).

[37]  Amin Ansari,et al.  ZerehCache: Armoring cache architectures in high defect density technologies , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Mahmut T. Kandemir,et al.  Leakage Current: Moore's Law Meets Static Power , 2003, Computer.

[39]  Abhishek Das,et al.  PAD: Power-Aware Directory Placement in Distributed Caches , 2010 .

[40]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[41]  Sani R. Nassif,et al.  Modeling and analysis of manufacturing variations , 2001, Proceedings of the IEEE 2001 Custom Integrated Circuits Conference (Cat. No.01CH37169).

[42]  Saurabh Dighe,et al.  A 48-Core IA-32 Processor in 45 nm CMOS Using On-Die Message-Passing and DVFS for Performance and Power Scaling , 2011, IEEE Journal of Solid-State Circuits.

[43]  Sandeep K. Shukla,et al.  A Brief History of Multiprocessors and EDA , 2011, IEEE Des. Test Comput..

[44]  Kaushik Roy,et al.  A 160 mV, fully differential, robust schmitt trigger based sub-threshold SRAM , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[45]  Srinivas Devadas,et al.  DCC: A Dependable Cache Coherence Multicore Architecture , 2011, IEEE Computer Architecture Letters.

[46]  Puneet Gupta,et al.  VaMV: Variability-aware Memory Virtualization , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[47]  Luca Benini,et al.  An integrated hardware/software approach for run-time scratchpad management , 2004, Proceedings. 41st Design Automation Conference, 2004..

[48]  Erik Brockmeyer,et al.  Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[49]  Luca Benini,et al.  An OpenMP Compiler for Efficient Use of Distributed Scratchpad Memory in MPSoCs , 2012, IEEE Transactions on Computers.

[50]  Chia-Lin Yang,et al.  Software-controlled cache architecture for energy efficiency , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[51]  Tulika Mitra,et al.  Scratchpad allocation for concurrent embedded software , 2010, TOPL.

[52]  Paul A. Karger,et al.  Multi-level security requirements for hypervisors , 2005, 21st Annual Computer Security Applications Conference (ACSAC'05).

[53]  Tien-Fu Chen,et al.  No cache-coherence: A single-cycle ring interconnection for multi-core L1-NUCA sharing on 3D chips , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[54]  Nikil D. Dutt,et al.  HaVOC: A hybrid memory-aware virtualization layer for on-chip distributed ScratchPad and Non-Volatile Memories , 2012, DAC Design Automation Conference 2012.

[55]  Xiaoxia Wu,et al.  Hybrid cache architecture with disparate memory technologies , 2009, ISCA '09.

[56]  Nikil D. Dutt,et al.  FFT-Cache: A Flexible Fault-Tolerant Cache architecture for ultra low voltage operation , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[57]  Heonshik Shin,et al.  Dynamic scratchpad memory management for code in portable systems with an MMU , 2008, TECS.

[58]  Rami G. Melhem,et al.  Using PCM in Next-generation Embedded Space Applications , 2010, 2010 16th IEEE Real-Time and Embedded Technology and Applications Symposium.

[59]  Georg Georgakos,et al.  Soft Error Rates in 65nm SRAMs--Analysis of new Phenomena , 2007, 13th IEEE International On-Line Testing Symposium (IOLTS 2007).

[60]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[61]  N. Okumura,et al.  A 600 MHz single-chip multiprocessor with 4.8 GB/s internal shared pipelined bus and 512 kB internal memory , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[62]  Andreas Moshovos,et al.  Low-leakage asymmetric-cell SRAM , 2002, ISLPED '02.

[63]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[64]  Coniferous softwood GENERAL TERMS , 2003 .

[65]  Tulika Mitra,et al.  Integrated scratchpad memory optimization and task scheduling for MPSoC architectures , 2006, CASES '06.

[66]  Aviral Shrivastava,et al.  Mitigating soft error failures for multimedia applications by selective data protection , 2006, CASES '06.

[67]  Sangyeun Cho,et al.  A content-aware block placement algorithm for reducing PRAM storage bit writes , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[68]  K. Ishibashi,et al.  16.7 fA/cell tunnel-leakage-suppressed 16 Mb SRAM for handling cosmic-ray-induced multi-errors , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[69]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[70]  E. Sackinger,et al.  A single-chip, 1.6-billion, 16-b MAC/s multiprocessor DSP , 2000, IEEE Journal of Solid-State Circuits.

[71]  Giorgos Dimitrakopoulos,et al.  LP-NUCA: Networks-in-Cache for High-Performance Low-Power Embedded Processors , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[72]  Soontae Kim Area-Efficient Error Protection for Caches , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[73]  Saurabh Dighe,et al.  The 48-core SCC Processor: the Programmer's View , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[74]  Edward J. McCluskey,et al.  PADded cache: a new fault-tolerance technique for cache memories , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[75]  Farshad Moradi,et al.  65NM sub-threshold 11T-SRAM for ultra low voltage applications , 2008, 2008 IEEE International SOC Conference.

[76]  Wei Wu,et al.  Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.

[77]  Hyunjin Lee,et al.  CloudCache: Expanding and shrinking private caches , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[78]  Ahmed M. Eltawil,et al.  Low-Power Multimedia System Design by Aggressive Voltage Scaling , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[79]  Luca Benini,et al.  Platform 2012, a many-core computing accelerator for embedded SoCs: Performance evaluation of visual analytics applications , 2012, DAC Design Automation Conference 2012.

[80]  Nikil D. Dutt,et al.  E < MC2: less energy through multi-copy cache , 2010, CASES '10.

[81]  Alberto Ros,et al.  A Direct Coherence Protocol for Many-Core Chip Multiprocessors , 2010, IEEE Transactions on Parallel and Distributed Systems.

[82]  Kai Ma,et al.  Temperature-constrained power control for chip multiprocessors with online model estimation , 2009, ISCA '09.

[83]  No Given Open Multimedia Platform for Next-Generation Mobile Devices , 2003, PATMOS.

[84]  Gianluca Palermo,et al.  Exploration of Distributed Shared Memory Architectures for NoC-based Multiprocessors , 2006, 2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[85]  Luca Benini,et al.  Networks on Chips : A New SoC Paradigm , 2022 .

[86]  Chita R. Das,et al.  Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[87]  Avesta Sasan,et al.  Process Variation Aware SRAM/Cache for aggressive voltage-frequency scaling , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[88]  Sanjay J. Patel,et al.  WAYPOINT: scaling coherence to thousand-core architectures , 2010, PACT '10.

[89]  Naehyuck Chang,et al.  Energy- and endurance-aware design of phase change memory caches , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[90]  Nikil D. Dutt,et al.  SPMVisor: Dynamic scratchpad memory virtualization for secure, low power, and high performance distributed on-chip memories , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[91]  Nikil D. Dutt,et al.  Efficient utilization of scratch-pad memory in embedded processor applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.

[92]  A.P. Chandrakasan,et al.  A 256-kb 65-nm Sub-threshold SRAM Design for Ultra-Low-Voltage Operation , 2007, IEEE Journal of Solid-State Circuits.

[93]  Guang R. Gao,et al.  Performance Modelling and Optimization of Memory Access on Cellular Computer Architecture Cyclops64 , 2005, NPC.

[94]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[95]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[96]  Luca Benini,et al.  Reliability Support for On-Chip Memories Using Networks-on-Chip , 2006, 2006 International Conference on Computer Design.