Simple DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems

In most modern systems, the memory subsystem is managed and accessed at multiple different granularities at various resources. We observe that such multi-granularity management results in significant inefficiency in the memory subsystem. Specifically, we observe that 1) page-granularity virtual memory unnecessarily triggers large memory operations, and 2) existing cache-line granularity memory interface is inefficient for performing bulk data operations and operations that exhibit poor spatial locality. To address these problems, we present a series of techniques in this thesis. First, we propose page overlays, a framework augments the existing virtual memory framework with the ability to track a new version of a subset of cache lines within each virtual page. We show that this extension is powerful by demonstrating its benefits on a number of applications. Second, we show that DRAM can be used to perform more complex operations than just store data. We propose RowClone, a mechanism to perform bulk data copy and initialization completely inside DRAM, and Buddy RAM, a mechanism to perform bulk bitwise operations using DRAM. Both these techniques achieve an order-of-magnitude improvement in the efficiency of the respective operations. Third, we propose Gather-Scatter DRAM, a technique that exploits DRAM organization to effectively gather/scatter values with a power-of-2 strided access patterns. For these access patterns, GS-DRAM achieves near-ideal bandwidth and cache utilization, without increasing the latency of fetching data from memory. Finally, we propose the Dirty-Block Index, a new way of tracking dirty blocks. In addition to improving the efficiency of bulk data coherence, DBI has several applications including high-performance memory scheduling, efficient cache lookup bypassing, and enabling heterogeneous ECC.

[1]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[2]  John S. Liptay,et al.  Structural Aspects of the System/360 Model 85 II: The Cache , 1968, IBM Syst. J..

[3]  Peter J. Denning,et al.  Virtual memory , 1970, CSUR.

[4]  Harold S. Stone,et al.  A Logic-in-Memory Computer , 1970, IEEE Transactions on Computers.

[5]  Leonidas J. Guibas,et al.  A dichromatic framework for balanced trees , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[6]  Stanley C. Eisenstat,et al.  Yale sparse matrix package I: The symmetric codes , 1982 .

[7]  Abraham Silberschatz,et al.  Operating System Concepts , 1983 .

[8]  William Jalby,et al.  XOR-Schemes: A Flexible Data Organization in Parallel Memories , 1985, ICPP.

[9]  Tilak Agerwala,et al.  Proceedings of the 42nd Annual International Symposium on Computer Architecture , 1985, ISCA 1985.

[10]  Alan Jay Smith,et al.  A class of compatible cache consistency protocols and their support by the IEEE futurebus , 1986, ISCA '86.

[11]  David L. Black,et al.  Translation lookaside buffer consistency: a software approach , 1989, ASPLOS III.

[12]  Mark E. Staknis,et al.  Sheaved memory: architectural support for state saving and restoration in pages systems , 1989, ASPLOS III.

[13]  Patricia J. Teller Translation-lookaside buffer consistency , 1990, Computer.

[14]  Hideto Hidaka,et al.  The cache DRAM architecture: a DRAM with an on-chip cache memory , 1990, IEEE Micro.

[15]  Janak H. Patel,et al.  Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.

[16]  B. Ramakrishna Rau,et al.  Pseudo-randomly interleaved memory , 1991, ISCA '91.

[17]  Eduard Ayguadé,et al.  Conflict-free access of vectors with power-of-two strides , 1992, ICS '92.

[18]  J.W.C. Fu,et al.  Stride Directed Prefetching In Scalar Processors , 1992, [1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25.

[19]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[20]  R. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[21]  André Seznec,et al.  Decoupled sectored caches: conciliating low tag implementation cost , 1994, ISCA '94.

[22]  Peter M. Kogge,et al.  EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[23]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[24]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[25]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[26]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[27]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[28]  Michael Burrows,et al.  Eraser: a dynamic data race detector for multi-threaded programs , 1997, TOCS.

[29]  Michel Cekleov,et al.  Virtual-address caches. Part 1: problems and solutions in uniprocessors , 1997, IEEE Micro.

[30]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[31]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[32]  Kazuaki Murakami,et al.  Optimizing the DRAM refresh count for merged DRAM/logic LSIs , 1998, Proceedings. 1998 International Symposium on Low Power Electronics and Design (IEEE Cat. No.98TH8379).

[33]  Yannis E. Ioannidis,et al.  Bitmap index design and evaluation , 1998, SIGMOD '98.

[34]  Janak H. Patel,et al.  A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[35]  M. Oskin,et al.  Active Pages: a computation model for intelligent memory , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[36]  Garth A. Gibson,et al.  Automatic I/O hint generation through speculative execution , 1999, OSDI '99.

[37]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[38]  Seung-Moon Yoo,et al.  FlexRAM: toward an advanced intelligent memory system , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[39]  Duncan G. Elliott,et al.  Computational RAM: Implementing Processors in Memory , 1999, IEEE Des. Test Comput..

[40]  Eun-Soo Kim,et al.  Optical image encryption based on XOR operations , 1999 .

[41]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[42]  Daniel Pierre Bovet,et al.  Understanding the Linux Kernel , 2000 .

[43]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[44]  Manoj Franklin,et al.  Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[45]  Zhen Fang,et al.  The Impulse Memory Controller , 2001, IEEE Trans. Computers.

[46]  Michael E. Wazlowski,et al.  IBM Memory Expansion Technology (MXT) , 2001, IBM J. Res. Dev..

[47]  Krste Asanovic,et al.  Mondrian memory protection , 2002, ASPLOS X.

[48]  Arie Shoshani,et al.  Compressing bitmap indexes for faster search operations , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[49]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[50]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[51]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[52]  Chun Chen,et al.  The architecture of the DIVA processing-in-memory chip , 2002, ICS '02.

[53]  Henry S. Warren,et al.  Hacker's Delight , 2002 .

[54]  Jose Renau,et al.  Programming the FlexRAM parallel intelligent memory system , 2003, PPoPP '03.

[55]  Edwin D. Reilly Memory-mapped I/O , 2003 .

[56]  Marios C. Papaefthymiou,et al.  Block-based multiperiod dynamic memory design for low data-retention power , 2003, IEEE Trans. Very Large Scale Integr. Syst..

[57]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[58]  Gernot Heiser,et al.  Legba: Fast Hardware Support for Fine-Grained Protection , 2003, Asia-Pacific Computer Systems Architecture Conference.

[59]  Glenn Reinman,et al.  Just say no: benefits of early cache miss determination , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[60]  Srikanth Kandula,et al.  Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging , 2004, USENIX Annual Technical Conference, General Track.

[61]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[62]  Wei Liu,et al.  iWatcher: efficient architectural support for software debugging , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[63]  Bradley C. Kuszmaul,et al.  Unbounded Transactional Memory , 2005, HPCA.

[64]  Henk D. L. Hollmann,et al.  XOR-based Visual Cryptography Schemes , 2005, Des. Codes Cryptogr..

[65]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[66]  Kim R. Rasmussen,et al.  Efficient q-Gram Filters for Finding All-Matches Over a Given Length , 2005 .

[67]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2005, IEEE Micro.

[68]  M. Ekman,et al.  A robust main-memory compression scheme , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[69]  Tal Garfinkel,et al.  Shredding Your Garbage: Reducing Data Lifetime Through Secure Deallocation , 2005, USENIX Security Symposium.

[70]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[71]  Edmund B. Nightingale,et al.  Speculative execution in a distributed file system , 2006, TOCS.

[72]  Amit Singh,et al.  Mac OS X Internals: A Systems Approach , 2006 .

[73]  Onur Mutlu,et al.  Efficient runahead execution processors , 2006 .

[74]  Hsien-Hsin S. Lee,et al.  An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[75]  Frederick A. Ware,et al.  Improving Power and Data Efficiency with Threaded Memory Modules , 2006, 2006 International Conference on Computer Design.

[76]  Laxmi N. Bhuyan,et al.  Hardware Support for Accelerating Data Movement in Server Platform , 2007, IEEE Transactions on Computers.

[77]  Onur Mutlu,et al.  Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems , 2007, USENIX Security Symposium.

[78]  S.A. Manavski,et al.  CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography , 2007, 2007 IEEE International Conference on Signal Processing and Communications.

[79]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[80]  Shouhuai Xu,et al.  Protecting Cryptographic Keys from Memory Disclosure Attacks , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[81]  Bingsheng He,et al.  Efficient gather and scatter operations on graphics processors , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[82]  Kesheng Wu,et al.  Bitmap Index Design Choices and Their Performance Implications , 2007, 11th International Database Engineering and Applications Symposium (IDEAS 2007).

[83]  Feng Lin,et al.  DRAM Circuit Design: Fundamental and High-Speed Topics , 2007 .

[84]  Onur Mutlu,et al.  Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[85]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[86]  Nicholas Nethercote,et al.  How to shadow every byte of memory used by a program , 2007, VEE '07.

[87]  Hsien-Hsin S. Lee,et al.  Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[88]  Onur Mutlu,et al.  Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[89]  Onur Mutlu,et al.  Prefetch-Aware DRAM Controllers , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[90]  George Varghese,et al.  Difference engine , 2010, OSDI.

[91]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[92]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[93]  Onur Mutlu,et al.  Online design bug detection: RTL analysis, flexible mechanisms, and evaluation , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[94]  Guru Venkataramani,et al.  FlexiTaint: A programmable accelerator for dynamic taint propagation , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[95]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[96]  Zhao Zhang,et al.  Mini-rank: Adaptive DRAM architecture for improving memory power efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[97]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[98]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[99]  Rajiv Gupta,et al.  Architectural support for shadow memory in multiprocessors , 2009, VEE '09.

[100]  Jung Ho Ahn,et al.  Multicore DIMM: an Energy Efficient Memory Module with Independently Controlled DRAMs , 2009, IEEE Computer Architecture Letters.

[101]  Christoforos E. Kozyrakis,et al.  Future scaling of processor-memory interfaces , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[102]  Ariel J. Feldman,et al.  Lest we remember: cold-boot attacks on encryption keys , 2008, CACM.

[103]  Jun Yang,et al.  A durable and energy efficient main memory using phase change memory technology , 2009, ISCA '09.

[104]  Lizy Kurian John,et al.  ESKIMO - energy savings using semantic knowledge of inconsequential memory occupancy for DRAM subsystem , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[105]  Yan Solihin,et al.  Architecture Support for Improving Bulk Memory Copying and Initialization Performance , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[106]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[107]  Michael Brudno,et al.  SHRiMP: Accurate Mapping of Short Color-space Reads , 2009, PLoS Comput. Biol..

[108]  K. Reinert,et al.  RazerS--fast read mapping with sensitivity control. , 2009, Genome research.

[109]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[110]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[111]  Eyal de Lara,et al.  SnowFlock: rapid virtual machine cloning for cloud computing , 2009, EuroSys '09.

[112]  Doe Hyun Yoon,et al.  Flexible cache error protection using an ECC FIFO , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[113]  Daniel J. Sorin,et al.  UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[114]  Alexander Zeier,et al.  HYRISE - A Main Memory Hybrid Storage Engine , 2010, Proc. VLDB Endow..

[115]  Onur Mutlu,et al.  DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems , 2010 .

[116]  Mor Harchol-Balter,et al.  ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[117]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[118]  H.-S. Philip Wong,et al.  Phase Change Memory , 2010, Proceedings of the IEEE.

[119]  Tony M. Brewer,et al.  Instruction Set Innovations for the Convey HC-1 Computer , 2010, IEEE Micro.

[120]  David W. Nellans,et al.  Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS XV.

[121]  Norman P. Jouppi,et al.  Rethinking DRAM design and organization for energy-constrained multi-cores , 2010, ISCA.

[122]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[123]  Lizy Kurian John,et al.  Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[124]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[125]  Lizy Kurian John,et al.  The virtual write queue: coordinating DRAM and last-level cache policies , 2010, ISCA.

[126]  Qin Zhao,et al.  Efficient memory shadowing for 64-bit architectures , 2010, ISMM '10.

[127]  Qin Zhao,et al.  Umbra: efficient and scalable memory shadowing , 2010, CGO '10.

[128]  Jason Flinn,et al.  Operating system support for application-specific speculation , 2011, EuroSys '11.

[129]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[130]  Bradford M. Beckmann,et al.  The gem5 simulator , 2011, CARN.

[131]  Doe Hyun Yoon,et al.  Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[132]  Kees G. W. Goossens,et al.  Improved Power Modeling of DDR SDRAMs , 2011, 2011 14th Euromicro Conference on Digital System Design.

[133]  Avi Mendelson,et al.  DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[134]  Masashi Horiguchi,et al.  Nanoscale Memory Repair , 2011, Integrated Circuits and Systems.

[135]  Xi Yang,et al.  Why nothing matters: the impact of zeroing , 2011, OOPSLA '11.

[136]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[137]  Kevin Kai-Wei Chang,et al.  Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[138]  Vitaly Shmatikov,et al.  Eternal Sunshine of the Spotless Machine: Protecting Privacy with Ephemeral Channels , 2012, OSDI.

[139]  Véronique Martin,et al.  Mapping Reads on a Genomic Sequence: An Algorithmic Overview and a Practical Comparative Analysis , 2012, J. Comput. Biol..

[140]  Richard Veras,et al.  RAIDR: Retention-aware intelligent DRAM refresh , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[141]  Onur Mutlu,et al.  Base-delta-immediate compression: Practical data compression for on-chip caches , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[142]  Doe Hyun Yoon,et al.  The dynamic granularity memory system , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[143]  Onur Mutlu,et al.  The evicted-address filter: A unified mechanism to address both cache pollution and thrashing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[144]  Sandhya Dwarkadas,et al.  Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[145]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[146]  Jacques-Olivier Klein,et al.  Failure and reliability analysis of STT-MRAM , 2012, Microelectron. Reliab..

[147]  Onur Mutlu,et al.  A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[148]  David R. Cheriton,et al.  HICAMP: architectural support for efficient concurrency-safe shared structured data access , 2012, ASPLOS XVII.

[149]  Madhu Mutyam,et al.  SkipCache: Miss-rate aware cache management , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[150]  Onur Mutlu,et al.  MISE: Providing performance predictability and improving fairness in shared main memory systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[151]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[152]  Bran Selic,et al.  A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[153]  Mohamad Towfik Krounbi,et al.  Basic principles of STT-MRAM cell operation in memory arrays , 2013 .

[154]  Onur Mutlu,et al.  An experimental study of data retention behavior in modern DRAM devices: implications for retention time profiling mechanisms , 2013, ISCA.

[155]  Onur Mutlu,et al.  Accelerating read mapping with FastHASH , 2013, BMC Genomics.

[156]  Onur Mutlu,et al.  Tiered-latency DRAM: A low latency and low cost DRAM architecture , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[157]  O Seongil,et al.  Reducing memory access latency with asymmetric DRAM bank organizations , 2013, ISCA.

[158]  José F. Martínez,et al.  Improving memory scheduling via processor-side load criticality information , 2013, ISCA.

[159]  Rachata Ausavarungnirun,et al.  RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[160]  Hui Zhao,et al.  A Scaling Roadmap and Performance Evaluation of In-Plane and Perpendicular MTJ Based STT-MRAMs for High-Density Cache Memory , 2013, IEEE Journal of Solid-State Circuits.

[161]  Franz Franchetti,et al.  Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[162]  Mahmut T. Kandemir,et al.  Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[163]  Onur Mutlu,et al.  Linearly compressed pages: A low-complexity, low-latency main memory compression framework , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[164]  Mario Badr,et al.  Load Value Approximation , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[165]  Onur Mutlu,et al.  Rollback-free value prediction with approximate loads , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[166]  Onur Mutlu,et al.  The efficacy of error mitigation techniques for DRAM retention failures: a comparative experimental study , 2014, SIGMETRICS '14.

[167]  Onur Mutlu,et al.  Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks , 2014, ACM Trans. Archit. Code Optim..

[168]  Onur Mutlu,et al.  The Dirty-Block Index , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[169]  Onur Mutlu,et al.  Improving DRAM performance by parallelizing refreshes with accesses , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[170]  Jun Yang,et al.  Mitigating Write Disturbance in Super-Dense Phase Change Memories , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[171]  Onur Mutlu,et al.  The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[172]  Ran Ginosar,et al.  GP-SIMD Processing-in-Memory , 2015, ACM Trans. Archit. Code Optim..

[173]  Onur Mutlu,et al.  FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[174]  Rami G. Melhem,et al.  Refresh Now and Then , 2014, IEEE Transactions on Computers.

[175]  Jimmy J. Lin,et al.  Summingbird: A Framework for Integrating Batch and Online MapReduce Computations , 2014, Proc. VLDB Endow..

[176]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[177]  Tao Zhang,et al.  Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[178]  3 D-Stacked Memory-Side Acceleration : Accelerator and System Design , 2014 .

[179]  Nikita Shamgunov The MemSQL In-Memory Database System , 2014, IMDM@VLDB.

[180]  Onur Mutlu,et al.  Adaptive-latency DRAM: Optimizing DRAM timing for the common-case , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[181]  Yoonho Park,et al.  Data access optimization in a processing-in-memory system , 2015, Conf. Computing Frontiers.

[182]  Mahmut T. Kandemir,et al.  Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[183]  Onur Mutlu,et al.  The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[184]  Sudhakar Yalamanchili,et al.  Near Data Processing: Impact and Optimization of 3D Memory System Architecture on the Uncore , 2015, MEMSYS.

[185]  Onur Mutlu,et al.  AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[186]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[187]  Wongyu Shin,et al.  Multiple Clone Row DRAM: A low latency and area optimized DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[188]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[189]  Franz Franchetti,et al.  Data reorganization in memory using 3D-stacked DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[190]  Chia-Lin Yang,et al.  Improving DRAM latency with dynamic asymmetric subarray , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[191]  Jongmoo Choi,et al.  Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[192]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[193]  Stratos Idreos,et al.  JAFAR: Near-Data Processing for Databases , 2015, SIGMOD Conference.

[194]  Onur Mutlu,et al.  Exploiting compressed block size as an indicator of future reuse , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[195]  Onur Mutlu,et al.  Gather-Scatter DRAM: In-DRAM address translation to improve the spatial locality of non-unit strided accesses , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[196]  Onur Mutlu,et al.  Fast Bulk Bitwise AND and OR in DRAM , 2015, IEEE Computer Architecture Letters.

[197]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[198]  Onur Mutlu,et al.  Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[199]  Onur Mutlu,et al.  ChargeCache: Reducing DRAM latency by exploiting row access locality , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[200]  Onur Mutlu,et al.  BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling , 2016, IEEE Transactions on Parallel and Distributed Systems.

[201]  Onur Mutlu,et al.  A case for toggle-aware compression for GPU systems , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[202]  Kevin Kai-Wei Chang,et al.  DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators , 2016, ACM Trans. Archit. Code Optim..

[203]  Onur Mutlu,et al.  Simultaneous Multi-Layer Access , 2016, ACM Trans. Archit. Code Optim..

[204]  Onur Mutlu,et al.  Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, ISCA.

[205]  Mingyu Gao,et al.  HRL: Efficient and flexible reconfigurable logic for near-data processing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[206]  Laura Hoch Understanding The Linux Virtual Memory Manager , 2016 .

[207]  Onur Mutlu,et al.  Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[208]  Onur Mutlu,et al.  PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[209]  Onur Mutlu,et al.  Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization , 2016, SIGMETRICS.