Exploiting the DRAM Microarchitecture to Increase Memory-Level Parallelism

This paper summarizes the idea of Subarray-Level Parallelism (SALP) in DRAM, which was published in ISCA 2012, and examines the work's significance and future potential. Modern DRAMs have multiple banks to serve multiple memory requests in parallel. However, when two requests go to the same bank, they have to be served serially, exacerbating the high latency of on-chip memory. Adding more banks to the system to mitigate this problem incurs high system cost. Our goal in this work is to achieve the benefits of increasing the number of banks with a low-cost approach. To this end, we propose three new mechanisms, SALP-1, SALP-2, and MASA (Multitude of Activated Subarrays), to reduce the serialization of different requests that go to the same bank. The key observation exploited by our mechanisms is that a modern DRAM bank is implemented as a collection of subarrays that operate largely independently while sharing few global peripheral structures. Our three proposed mechanisms mitigate the negative impact of bank serialization by overlapping different components of the bank access latencies of multiple requests that go to different subarrays within the same bank. SALP-1 requires no changes to the existing DRAM structure, and needs to only reinterpret some of the existing DRAM timing parameters. SALP-2 and MASA require only modest changes (< 0.15% area overhead) to the DRAM peripheral structures, which are much less design constrained than the DRAM core. Our evaluations show that SALP-1, SALP-2 and MASA significantly improve performance for both single-core systems (7%/13%/17%) and multi-core systems (15%/16%/20%), averaged across a wide range of workloads. We also demonstrate that our mechanisms can be combined with application-aware memory request scheduling in multicore systems to further improve performance and fairness.

[1]  Kevin Kai-Wei Chang,et al.  DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators , 2016, ACM Trans. Archit. Code Optim..

[2]  Jung Ho Ahn,et al.  Multicore DIMM: an Energy Efficient Memory Module with Independently Controlled DRAMs , 2009, IEEE Computer Architecture Letters.

[3]  Tao Lu,et al.  LAMS: A latency-aware memory scheduling policy for modern DRAM systems , 2016, 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC).

[4]  Tze Meng Low,et al.  3 D-Stacked Memory-Side Acceleration : Accelerator and System Design , 2014 .

[5]  Onur Mutlu,et al.  Distributed order scheduling and its application to multi-core dram controllers , 2008, PODC '08.

[6]  Wongyu Shin,et al.  Multiple Clone Row DRAM: A low latency and area optimized DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[7]  Onur Mutlu,et al.  The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Onur Mutlu,et al.  Improving DRAM performance by parallelizing refreshes with accesses , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[9]  Onur Mutlu,et al.  Improving memory Bank-Level Parallelism in the presence of prefetching , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Onur Mutlu,et al.  Chapter Four - Simple Operations in Memory to Reduce Data Movement , 2017, Adv. Comput..

[11]  James E. Smith,et al.  Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[12]  Franz Franchetti,et al.  Data reorganization in memory using 3D-stacked DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[13]  Seung-Moon Yoo,et al.  FlexRAM: Toward an advanced Intelligent Memory system , 1999, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[14]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[15]  James E. Smith,et al.  Performance Of Cached Dram Organizations In Vector Supercomputers , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[16]  Zhen Fang,et al.  Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[17]  Onur Mutlu,et al.  FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[18]  Tao Zhang,et al.  CREAM: A Concurrent-Refresh-Aware DRAM Memory architecture , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[19]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[20]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[21]  Jing Li,et al.  A case for small row buffers in non-volatile main memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[22]  Onur Mutlu,et al.  Adaptive-latency DRAM: Optimizing DRAM timing for the common-case , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[23]  Harold S. Stone,et al.  A Logic-in-Memory Computer , 1970, IEEE Transactions on Computers.

[24]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[25]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[26]  Onur Mutlu,et al.  Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization , 2016, SIGMETRICS.

[27]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[28]  S. Phadke,et al.  MLP aware heterogeneous memory system , 2011, 2011 Design, Automation & Test in Europe.

[29]  Chun Chen,et al.  The architecture of the DIVA processing-in-memory chip , 2002, ICS '02.

[30]  Stratos Idreos,et al.  JAFAR: Near-Data Processing for Databases , 2015, SIGMOD Conference.

[31]  Onur Mutlu,et al.  Gather-Scatter DRAM: In-DRAM address translation to improve the spatial locality of non-unit strided accesses , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[32]  Hiroyuki Kobayashi,et al.  Fast cycle RAM (FCRAM); a 20-ns random row access, pipe-lined operating DRAM , 1998, 1998 Symposium on VLSI Circuits. Digest of Technical Papers (Cat. No.98CH36215).

[33]  Zhao Zhang,et al.  Mini-rank: Adaptive DRAM architecture for improving memory power efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[34]  Mahmut T. Kandemir,et al.  Exploiting Core Criticality for Enhanced GPU Performance , 2016, SIGMETRICS.

[35]  Yoonho Park,et al.  Data access optimization in a processing-in-memory system , 2015, Conf. Computing Frontiers.

[36]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[37]  Zhen Fang,et al.  The Impulse Memory Controller , 2001, IEEE Trans. Computers.

[38]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[39]  Onur Mutlu,et al.  An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms , 2013, ISCA.

[40]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[41]  Kevin Kai-Wei Chang,et al.  Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[42]  Xiaolang Yan,et al.  Memory Access Scheduling Based on Dynamic Multilevel Priority in Shared DRAM Systems , 2016, ACM Trans. Archit. Code Optim..

[43]  Rachata Ausavarungnirun,et al.  Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms , 2017, SIGMETRICS.

[44]  Kunle Olukotun,et al.  The hierarchical multi-bank DRAM: a high-performance architecture for memory integrated with processors , 1997, Proceedings Seventeenth Conference on Advanced Research in VLSI.

[45]  Thomas Vogelsang,et al.  Understanding the Energy Consumption of Dynamic Random Access Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[46]  Onur Mutlu,et al.  Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management , 2012, IEEE Computer Architecture Letters.

[47]  Onur Mutlu,et al.  Techniques for efficient processing in runahead execution engines , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[48]  Chris Fallin,et al.  Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[49]  Chris Fallin,et al.  Parallel application memory scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[50]  Hongzhong Zheng,et al.  Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling , 2014 .

[51]  Onur Mutlu,et al.  SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[52]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[53]  Rachata Ausavarungnirun,et al.  Row buffer locality aware caching policies for hybrid memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[54]  Mingyu Gao,et al.  HRL: Efficient and flexible reconfigurable logic for near-data processing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[55]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[56]  Jun Yang,et al.  Phase-Change Technology and the Future of Main Memory , 2010, IEEE Micro.

[57]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[58]  Christoforos E. Kozyrakis,et al.  Improving System Energy Efficiency with Memory Rank Subsetting , 2012, TACO.

[59]  Sai Prashanth Muralidhara,et al.  Reducing memory interference in multicore systems via application-aware memory channel partitioning , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[60]  Laxmi N. Bhuyan,et al.  Hardware support for bulk data movement in server platforms , 2005, 2005 International Conference on Computer Design.

[61]  Onur Mutlu,et al.  Memory scaling: A systems architecture perspective , 2013, 2013 5th IEEE International Memory Workshop.

[62]  Wongyu Shin,et al.  NUAT: A non-uniform access time memory controller , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[63]  Mahmut T. Kandemir,et al.  Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[64]  Onur Mutlu,et al.  MISE: Providing performance predictability and improving fairness in shared main memory systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[65]  José F. Martínez,et al.  MORSE: Multi-objective reconfigurable self-optimizing memory scheduler , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[66]  Onur Mutlu,et al.  The reach profiler (REAPER): Enabling the mitigation of DRAM retention failures via profiling at aggressive conditions , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[67]  O Seongil,et al.  Reducing memory access latency with asymmetric DRAM bank organizations , 2013, ISCA.

[68]  Lizy Kurian John,et al.  The virtual write queue: coordinating DRAM and last-level cache policies , 2010, ISCA.

[69]  Peter M. Kogge,et al.  EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[70]  Zhao Zhang,et al.  Cached DRAM for ILP Processor Memory Access Latency Reduction , 2001, IEEE Micro.

[71]  Chia-Lin Yang,et al.  Improving DRAM latency with dynamic asymmetric subarray , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[72]  Onur Mutlu,et al.  Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[73]  Jean-Loup BaerDepartment DRAM Caching , 1997 .

[74]  Onur Mutlu,et al.  Prefetch-Aware DRAM Controllers , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[75]  Rachata Ausavarungnirun,et al.  Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks , 2018, ASPLOS.

[76]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[77]  Onur Mutlu,et al.  A Case for Effic ient Hardware/Soft ware Cooperative Management of Storage and Memory , 2013 .

[78]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[79]  Balaram Sinharoy,et al.  IBM POWER7 multicore server processor , 2011 .

[80]  Rajeev Balasubramonian,et al.  Managing DRAM Latency Divergence in Irregular GPGPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[81]  Richard Veras,et al.  RAIDR: Retention-aware intelligent DRAM refresh , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[82]  Frederick A. Ware,et al.  Improving Power and Data Efficiency with Threaded Memory Modules , 2006, 2006 International Conference on Computer Design.

[83]  Onur Mutlu,et al.  Phase change memory architecture and the quest for scalability , 2010, Commun. ACM.

[84]  Rachata Ausavarungnirun,et al.  The Processing-in-Memory Paradigm: Mechanisms to Enable Adoption , 2018, Beyond-CMOS Technologies for Next Generation Computer Design.

[85]  Jie Liu,et al.  Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[86]  Onur Mutlu,et al.  Research Problems and Opportunities in Memory Systems , 2014, Supercomput. Front. Innov..

[87]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[88]  Eric M. Schwarz,et al.  The zEnterprise 196 System and Microprocessor , 2011, IEEE Micro.

[89]  Onur Mutlu,et al.  BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling , 2016, IEEE Transactions on Parallel and Distributed Systems.

[90]  Feng Lin,et al.  DRAM Circuit Design: Fundamental and High-Speed Topics , 2007 .

[91]  William J. Dally,et al.  Architectural Support for the Stream Execution Model on General-Purpose Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[92]  Onur Mutlu,et al.  Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems , 2007, USENIX Security Symposium.

[93]  Björn Andersson,et al.  Bounding and reducing memory interference in COTS-based multi-core systems , 2016, Real-Time Systems.

[94]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[95]  David W. Nellans,et al.  Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS XV.

[96]  HidakaHideto,et al.  The Cache DRAM Architecture , 1990 .

[97]  Mahmut T. Kandemir,et al.  Scheduling techniques for GPU architectures with processing-in-memory capabilities , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[98]  O Seongil,et al.  Row-buffer decoupling: A case for low-latency DRAM microarchitecture , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[99]  Norman P. Jouppi,et al.  Rethinking DRAM design and organization for energy-constrained multi-cores , 2010, ISCA.

[100]  Onur Mutlu,et al.  The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[101]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[102]  Onur Mutlu,et al.  ChargeCache: Reducing DRAM latency by exploiting row access locality , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[103]  Onur Mutlu,et al.  Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[104]  Onur Mutlu,et al.  A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[105]  Onur Mutlu,et al.  DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems , 2010 .

[106]  Gershon Kedem,et al.  WCDRAM: A fully associative integrated Cached-DRAM with wide cache lines , 1997 .

[107]  Yifeng Zhu,et al.  Exploiting subarrays inside a bank to improve phase change memory performance , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[108]  Erik Nelson,et al.  A 45 nm SOI Embedded DRAM Macro for the POWER™ Processor 32 MByte On-Chip L3 Cache , 2011, IEEE Journal of Solid-State Circuits.

[109]  Björn Andersson,et al.  Bounding memory interference delay in COTS-based multi-core systems , 2014, 2014 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS).

[110]  Mor Harchol-Balter,et al.  ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[111]  Onur Mutlu,et al.  Tiered-latency DRAM: A low latency and low cost DRAM architecture , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[112]  Michael Gschwind Chip multiprocessing and the cell broadband engine , 2006, CF '06.

[113]  Jongmoo Choi,et al.  Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[114]  Rachata Ausavarungnirun,et al.  RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[115]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[116]  Mahmut T. Kandemir,et al.  Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[117]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[118]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[119]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[120]  Onur Mutlu,et al.  Address-value delta (AVD) prediction: increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[121]  Vijayalakshmi Srinivasan,et al.  Enhancing lifetime and security of PCM-based Main Memory with Start-Gap Wear Leveling , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[122]  Onur Mutlu,et al.  Simultaneous Multi-Layer Access , 2016, ACM Trans. Archit. Code Optim..

[123]  Onur Mutlu,et al.  Runahead Execution: An Effective Alternative to Large Instruction Windows , 2003, IEEE Micro.

[124]  Sangkug Lym,et al.  ERUCA: Efficient DRAM Resource Utilization and Resource Conflict Avoidance for Memory System Parallelism , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[125]  Feifei Li,et al.  NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[126]  Mattan Erez,et al.  A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC , 2012, DAC Design Automation Conference 2012.

[127]  Onur Mutlu,et al.  Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[128]  Yan Solihin,et al.  Architecture Support for Improving Bulk Memory Copying and Initialization Performance , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[129]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[130]  Onur Mutlu,et al.  GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies , 2017, BMC Genomics.

[131]  Kevin Kai-Wei Chang,et al.  SQUASH: Simple QoS-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators , 2015, ArXiv.

[132]  Onur Mutlu,et al.  Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[133]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[134]  Onur Mutlu,et al.  Fast Bulk Bitwise AND and OR in DRAM , 2015, IEEE Computer Architecture Letters.

[135]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[136]  Jose Renau,et al.  Programming the FlexRAM parallel intelligent memory system , 2003, PPoPP '03.

[137]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[138]  José F. Martínez,et al.  Improving memory scheduling via processor-side load criticality information , 2013, ISCA.

[139]  Onur Mutlu,et al.  The RowHammer problem and other issues we may face as memory becomes denser , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[140]  Norman P. Jouppi,et al.  Staged Reads: Mitigating the impact of DRAM writes on DRAM reads , 2012, IEEE International Symposium on High-Performance Comp Architecture.