Near Data Acceleration with Concurrent Host Access

Near-data accelerators (NDAs) that are integrated with the main memory have the potential for significant power and performance benefits. Fully realizing these benefits requires the large available memory capacity to be shared between the host and NDAs in a way that permits both regular memory access by some applications and accelerating others with an NDA, avoids copying data, enables collaborative processing, and simultaneously offers high performance for both host and NDA. We identify and solve new challenges in this context: mitigating row-locality interference from host to NDAs, reducing read/write-turnaround overhead caused by fine-grain interleaving of host and NDA requests, architecting a memory layout that supports the locality required for NDAs and sophisticated address interleaving for host performance, and supporting both packetized and traditional memory interfaces. We demonstrate our approach in a simulated system that consists of a multi-core CPU and NDA-enabled DDR4 memory modules. We show that our mechanisms enable effective and efficient concurrent access using a set of microbenchmarks, then demonstrate the potential of the system for the important stochastic variance-reduced gradient (SVRG) algorithm.

[1]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[2]  Jung Ho Ahn,et al.  The Design Space of Data-Parallel Memory Systems , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[3]  Onur Mutlu,et al.  LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory , 2017, IEEE Computer Architecture Letters.

[4]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[5]  Guang R. Gao,et al.  Processing In Memory: Chips to Petaflops , 1997, ISCA 1997.

[6]  Rachata Ausavarungnirun,et al.  CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[7]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[8]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[9]  Seung-Moon Yoo,et al.  FlexRAM: Toward an advanced Intelligent Memory system , 1999, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[10]  Jung Ho Ahn,et al.  Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[12]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[13]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[14]  Yoonho Park,et al.  Data access optimization in a processing-in-memory system , 2015, Conf. Computing Frontiers.

[15]  Christoforos E. Kozyrakis,et al.  GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[16]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[17]  Yiran Chen,et al.  ReGAN: A pipelined ReRAM-based accelerator for generative adversarial networks , 2018, 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC).

[18]  Youngjin Kwon,et al.  Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.

[19]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[20]  Saibal Mukhopadhyay,et al.  ReRAM-Based Processing-in-Memory Architecture for Recurrent Neural Network Acceleration , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[21]  Mahmut T. Kandemir,et al.  Scheduling techniques for GPU architectures with processing-in-memory capabilities , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[22]  Gwangsun Kim,et al.  Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Yingwei Luo,et al.  Get Out of the Valley: Power-Efficient Address Mapping for GPUs , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[24]  Andrew B. Kahng,et al.  CACTI-IO: CACTI with off-chip power-area-timing models , 2012, 2012 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[25]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[26]  Patrick J. Meaney,et al.  The IBM z13 memory subsystem for big data , 2015, IBM J. Res. Dev..

[27]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[28]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[29]  Ramyad Hadidi,et al.  GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[30]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[31]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[32]  Babak Falsafi,et al.  The mondrian data engine , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[33]  Dong Li,et al.  Processing-in-Memory for Energy-Efficient Neural Network Training: A Heterogeneous Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34]  Nam Sung Kim,et al.  NetDIMM: Low-Latency Near-Memory Network Interface Architecture , 2019, MICRO.

[35]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[36]  Onur Mutlu,et al.  Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  Stefan Mangard,et al.  DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks , 2015, USENIX Security Symposium.

[38]  Cong Xu,et al.  Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[39]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[40]  Harold S. Stone,et al.  A Logic-in-Memory Computer , 1970, IEEE Transactions on Computers.

[41]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[42]  Mel Gorman,et al.  Understanding the Linux Virtual Memory Manager , 2004 .

[43]  Jung Ho Ahn,et al.  CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[44]  Minsoo Rhu,et al.  TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning , 2019, MICRO.

[45]  Huazhong Yang,et al.  Energy-efficient SQL query exploiting RRAM-based process-in-memory structure , 2017, 2017 IEEE 6th Non-Volatile Memory Systems and Applications Symposium (NVMSA).

[46]  Yiran Chen,et al.  GraphR: Accelerating Graph Processing Using ReRAM , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[47]  Reena Panda,et al.  Wait of a Decade: Did SPEC CPU 2017 Broaden the Performance Horizon? , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[48]  Jinjun Xiong,et al.  Application-Transparent Near-Memory Processing Architecture with Memory Channel Network , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[49]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[50]  Fabrice Devaux,et al.  The true Processing In Memory accelerator , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).

[51]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[52]  Dam Sunwoo,et al.  Balancing DRAM locality and parallelism in shared memory CMP systems , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[53]  Bahar Asgari,et al.  Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube , 2017, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[54]  Onur Mutlu,et al.  Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[55]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[56]  Onur Mutlu,et al.  Fast Bulk Bitwise AND and OR in DRAM , 2015, IEEE Computer Architecture Letters.

[57]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[58]  Franz Franchetti,et al.  HAMLeT Architecture for Parallel Data Reorganization in Memory , 2016, IEEE Micro.

[59]  Xiaobing Feng,et al.  Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors , 2010, NPC.

[60]  Rachata Ausavarungnirun,et al.  Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks , 2018, ASPLOS.

[61]  Sudhakar Yalamanchili,et al.  Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[62]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[63]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[64]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[65]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[66]  Jung Ho Ahn,et al.  Accelerating linked-list traversal through near-data processing , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[67]  Stefan Mangard,et al.  Reverse Engineering Intel DRAM Addressing and Exploitation , 2015, ArXiv.

[68]  Yuan Xie,et al.  DRISA: A DRAM-based Reconfigurable In-Situ Accelerator , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[69]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[70]  Franz Franchetti,et al.  Data reorganization in memory using 3D-stacked DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[71]  Lei Liu,et al.  A software memory partition approach for eliminating bank-level interference in multicore systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[72]  Hyesoon Kim,et al.  BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[73]  Lizy Kurian John,et al.  The virtual write queue: coordinating DRAM and last-level cache policies , 2010, ISCA.

[74]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[75]  Maurice Herlihy,et al.  Concurrent Data Structures for Near-Memory Computing , 2017, SPAA.

[76]  Peter M. Kogge,et al.  EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[77]  Rodolfo Pellizzoni,et al.  PALLOC: DRAM bank-aware memory allocator for performance isolation on multicore platforms , 2014, 2014 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS).

[78]  Tejas Karkhanis,et al.  Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..

[79]  Onur Mutlu,et al.  Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[80]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[81]  Yiran Chen,et al.  PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).