Near-Memory Computing: Past, Present, and Future

The conventional approach of moving data to the CPU for computation has become a significant performance bottleneck for emerging scale-out data-intensive applications due to their limited data reuse. At the same time, the advancement in 3D integration technologies has made the decade-old concept of coupling compute units close to the memory --- called near-memory computing (NMC) --- more viable. Processing right at the "home" of data can significantly diminish the data movement problem of data-intensive applications. In this paper, we survey the prior art on NMC across various dimensions (architecture, applications, tools, etc.) and identify the key challenges and open issues with future research directions. We also provide a glimpse of our approach to near-memory computing that includes i) NMC specific microarchitecture independent application characterization ii) a compiler framework to offload the NMC kernels on our target NMC platform and iii) an analytical model to evaluate the potential of NMC.

[1]  Yong Chen,et al.  HMC-Sim-2.0: A Simulation Platform for Exploring Custom Memory Cube Operations , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[2]  Huawei Li,et al.  ProPRAM: Exploiting the transparent logic resources in Non-Volatile Memory for Near Data Computing , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[3]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[4]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[5]  R. Jongerius,et al.  End-to-end compute model of the Square Kilometre Array , 2014 .

[6]  Josep Torrellas,et al.  A Near-Memory Processor for Vector, Streaming and Bit Manipulation Workloads , 2005 .

[7]  Mingyu Gao,et al.  HRL: Efficient and flexible reconfigurable logic for near-data processing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[8]  Luigi Carro,et al.  Design space exploration for PIM architectures in 3D-stacked memories , 2018, CF.

[9]  Mahmut T. Kandemir,et al.  Scheduling techniques for GPU architectures with processing-in-memory capabilities , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[10]  Vladimir Vlassov,et al.  Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters , 2016, 2016 IEEE/ACM 3rd International Conference on Big Data Computing Applications and Technologies (BDCAT).

[11]  David A. Patterson,et al.  A case for intelligent disks (IDISKs) , 1998, SGMD.

[12]  Gustavo Alonso,et al.  Caribou: Intelligent Distributed Storage , 2017, Proc. VLDB Endow..

[13]  Luca Benini,et al.  Design and Evaluation of a Processing-in-Memory Architecture for the Smart Memory Cube , 2016, ARCS.

[14]  Hyunok Oh,et al.  Data mining in intelligent SSD: Simulation-based evaluation , 2016, 2016 International Conference on Big Data and Smart Computing (BigComp).

[15]  Stark C. Draper,et al.  Notary: Hardware techniques to enhance signatures , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[16]  Christoph Hagleitner,et al.  Sorting big data on heterogeneous near-data processing systems , 2017, Conf. Computing Frontiers.

[17]  Chen Ding,et al.  A component model of spatial locality , 2009, ISMM '09.

[18]  Lieven Eeckhout,et al.  Microarchitecture-Independent Workload Characterization , 2007, IEEE Micro.

[19]  Rachata Ausavarungnirun,et al.  Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks , 2018, ASPLOS.

[20]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[21]  Yang Liu,et al.  Willow: A User-Programmable SSD , 2014, OSDI.

[22]  Dong Ping Zhang,et al.  Scaling Deep Learning on Multiple In-Memory Processors , 2015 .

[23]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[24]  M. Hosomi,et al.  A novel nonvolatile memory with spin torque transfer magnetization switching: spin-ram , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[25]  Amrita Mazumdar,et al.  Application Codesign of Near-Data Processing for Similarity Search , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[26]  K. Steinhubl Design of Ion-Implanted MOSFET'S with Very Small Physical Dimensions , 1974 .

[27]  Duncan G. Elliott,et al.  Computational Ram: A Memory-simd Hybrid And Its Application To Dsp , 1992, 1992 Proceedings of the IEEE Custom Integrated Circuits Conference.

[28]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[29]  Maya Gokhale,et al.  Near memory data structure rearrangement , 2015, MEMSYS.

[30]  Jinjun Xiong,et al.  Application-Transparent Near-Memory Processing Architecture with Memory Channel Network , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Henk Corporaal,et al.  Platform Independent Software Analysis for Near Memory Computing , 2019, 2019 22nd Euromicro Conference on Digital System Design (DSD).

[32]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[33]  Young-Hyun Jun,et al.  A 1.2 V 12.8 GB/s 2 Gb Mobile Wide-I/O DRAM With 4 $\times$ 128 I/Os Using TSV Based Stacking , 2011, IEEE Journal of Solid-State Circuits.

[34]  Babak Falsafi,et al.  Sort vs. Hash Join Revisited for Near-Memory Execution , 2015 .

[35]  Trevor N. Mudge,et al.  A limits study of benefits from nanostore-based future data-centric system architectures , 2012, CF '12.

[36]  Onur Mutlu,et al.  Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[37]  Ahsan Javed Awan Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server , 2017 .

[38]  Reetuparna Das,et al.  Exploring specialized near-memory processing for data intensive operations , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[39]  Rachata Ausavarungnirun,et al.  Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions , 2018, ArXiv.

[40]  Henk Corporaal,et al.  An End-to-End Computing Model for the Square Kilometre Array , 2014, Computer.

[41]  Paolo Faraboschi,et al.  COTSon: infrastructure for full system simulation , 2009, OPSR.

[42]  Hossein Bobarshad,et al.  Catalina: In-Storage Processing Acceleration for Scalable Big Data Analytics , 2019, 2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP).

[43]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[44]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[45]  Sander Stuijk,et al.  NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[46]  Vladimir Vlassov,et al.  Identifying the potential of near data processing for apache spark , 2017, MEMSYS.

[47]  Peter M. Kogge,et al.  EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[48]  MutluOnur,et al.  Google Workloads for Consumer Devices , 2018 .

[49]  Sungroh Yoon,et al.  Near-Data Processing for Machine Learning , 2016, ArXiv.

[50]  Peng Xie,et al.  A Study on Non-volatile 3D Stacked Memory for Big Data Applications , 2015, ICA3PP.

[51]  Tejas Karkhanis,et al.  Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..

[52]  Franz Franchetti,et al.  Data reorganization in memory using 3D-stacked DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[53]  Seung-Moon Yoo,et al.  FlexRAM: toward an advanced intelligent memory system , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[54]  Philippe Olivier Alexandre Navaux,et al.  SiNUCA: A Validated Micro-Architecture Simulator , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[55]  Vladimir Vlassov,et al.  Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server , 2015, 2015 IEEE Fifth International Conference on Big Data and Cloud Computing.

[56]  Jin-Soo Kim,et al.  Self-sorting SSD: Producing sorted data inside active SSDs , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[57]  Jinyoung Lee,et al.  Biscuit: A Framework for Near-Data Processing of Big Data Workloads , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[58]  Jaejin Lee,et al.  25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[59]  Gero Dittmann,et al.  Analytic Multi-Core Processor Model for Fast Design-Space Exploration , 2018, IEEE Transactions on Computers.

[60]  Onur Mutlu,et al.  Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[61]  Luigi Carro,et al.  Operand size reconfiguration for big data processing in memory , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[62]  Manos Athanassoulis,et al.  Beyond the Wall: Near-Data Processing for Databases , 2015, DaMoN.

[63]  Sizhuo Zhang,et al.  GraFBoost: Using Accelerated Flash Storage for External Graph Analytics , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[64]  Shahin Nazarian,et al.  Prometheus: Processing-in-memory heterogeneous architecture design from a multi-layer network theoretic strategy , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[65]  David J. DeWitt,et al.  Query Processing on Smart SSDs , 2014, IEEE Data Eng. Bull..

[66]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[67]  Michael F. Deering,et al.  FBRAM: a new form of memory optimized for 3D graphics , 1994, SIGGRAPH.

[68]  Sander Stuijk,et al.  A Review of Near-Memory Computing Architectures: Opportunities and Challenges , 2018, 2018 21st Euromicro Conference on Digital System Design (DSD).

[69]  K QureshiMoinuddin,et al.  Scalable high performance main memory system using phase-change memory technology , 2009 .

[70]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[71]  Gabriel H. Loh Nuwan Jayasena Mark H. Oskin Mark Nutter Da Ignatowski A Processing-in-Memory Taxonomy and a Case for Studying Fixed-function PIM , 2013 .

[72]  Babak Falsafi,et al.  The mondrian data engine , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[73]  Luigi Carro,et al.  A generic processing in memory cycle accurate simulator under hybrid memory cube architecture , 2017, 2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[74]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[75]  Dong Li,et al.  Processing-in-Memory for Energy-Efficient Neural Network Training: A Heterogeneous Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[76]  Christos Faloutsos,et al.  Active Storage for Large-Scale Data Mining and Multimedia , 1998, VLDB.

[77]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[78]  Yoonho Park,et al.  Data access optimization in a processing-in-memory system , 2015, Conf. Computing Frontiers.

[79]  Ravi Nair,et al.  Evolution of Memory Architecture , 2015, Proceedings of the IEEE.

[80]  Christian Lengauer,et al.  Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[81]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[82]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[83]  Rainer Buchty,et al.  Data-Centric Computing Frontiers: A Survey On Processing-In-Memory , 2016, MEMSYS.

[84]  Ramyad Hadidi,et al.  GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[85]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[86]  Doohwan Oh,et al.  XSD: Accelerating MapReduce by Harnessing the GPU inside an SSD , 2013 .

[87]  Harold S. Stone,et al.  A Logic-in-Memory Computer , 1970, IEEE Transactions on Computers.

[88]  Kiyoung Choi,et al.  Buffered compares: Excavating the hidden parallelism inside DRAM architectures with lightweight logic , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[89]  Tobias Grosser,et al.  Declarative Transformations in the Polyhedral Model , 2018 .

[90]  Nader Bagherzadeh,et al.  CompStor: An In-storage Computation Platform for Scalable Distributed Processing , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[91]  Beng Chin Ooi,et al.  In-Memory Big Data Management and Processing: A Survey , 2015, IEEE Transactions on Knowledge and Data Engineering.

[92]  Feifei Li,et al.  NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[93]  Ramyad Hadidi,et al.  CAIRO , 2017, ACM Trans. Archit. Code Optim..

[94]  Chanik Park,et al.  Enabling cost-effective data processing with smart SSD , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[95]  Ki-Seok Chung,et al.  CasHMC: A Cycle-Accurate Simulator for Hybrid Memory Cube , 2017, IEEE Computer Architecture Letters.

[96]  Babak Falsafi,et al.  Near-Memory Address Translation , 2016, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[97]  Henk Corporaal,et al.  Memory and Parallelism Analysis Using a Platform-Independent Approach , 2019, SCOPES.

[98]  Parthasarathy Ranganathan,et al.  From Microprocessors to Nanostores: Rethinking Data-Centric Systems , 2011, Computer.

[99]  M SwiftMichael,et al.  Efficient virtual memory for big memory servers , 2013 .

[100]  Daniel M. Dreps,et al.  IBM POWER9 opens up a new era of acceleration enablement: OpenCAPI , 2018, IBM J. Res. Dev..

[101]  Sungjin Lee,et al.  BlueDBM: An appliance for Big Data analytics , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[102]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[103]  Steven Swanson,et al.  Summarizer: Trading Communication with Computing Near Storage , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[104]  Tze Meng Low,et al.  Enabling portable energy efficiency with memory accelerated library , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[105]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[106]  David M. Brooks,et al.  ISA-independent workload characterization and its implications for specialized architectures , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[107]  Henk Corporaal,et al.  Memristor based computation-in-memory architecture for data-intensive applications , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[108]  Vladimir Vlassov,et al.  Micro-Architectural Characterization of Apache Spark on Batch and Stream Processing Workloads , 2016, 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom).

[109]  Rachata Ausavarungnirun,et al.  CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).