Performance Implications of Processing-in-Memory Designs on Data-Intensive Applications

The popularity of data-intensive applications and recent hardware developments drive the re-emergence of processing-in-memory (PIM) after earlier explorations several decades ago. To introduce PIM into a system, we must answer a fundamental question: what computation logic should be included into PIM? In terms of computation complexity, PIM can be either relatively simple, fixedfunctional, or fully programmable. The choice of fixedfunctional PIM and programmable PIM has direct impact on performance. In this paper, we explore the performance implications of fixed-functional PIM and programmable PIM on three data-intensive benchmarks-including a real data-intensive application. Our results show that - with PIMs - we obtain 2.09x-91.4x speedup over no PIM cases. However, the fixed-functional PIM and programmable PIM perform differently across applications (with performance difference up to 90%). Our results show that neither fixed-functional PIM nor programmable PIM can perform optimally in all cases. We must decide the usage of PIM based on the characteristics of the workload and PIM (e.g., instruction-level parallelism), and the PIM overhead (e.g., PIM initialization and synchronization overhead).

[1]  Tack-Don Han,et al.  An effective memory-processor integrated architecture for computer vision , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[2]  Zvika Guz Real-Time Analytics as the Killer Application for Processing-In-Memory , 2014 .

[3]  Peter M. Kogge,et al.  EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[4]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[5]  Tze Meng Low,et al.  3 D-Stacked Memory-Side Acceleration : Accelerator and System Design , 2014 .

[6]  Franz Franchetti,et al.  Data reorganization in memory using 3D-stacked DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[7]  Jose Renau,et al.  Programming the FlexRAM parallel intelligent memory system , 2003, PPoPP '03.

[8]  Mike Ignatowski,et al.  High-level Programming Model Abstractions for Processing in Memory , 2013 .

[9]  Florin Rusu,et al.  Scalable Analytics Model Calibration with Online Aggregation , 2015, IEEE Data Eng. Bull..

[10]  Florin Rusu,et al.  Scalable I/O-bound parallel incremental gradient descent for big data analytics in GLADE , 2013, DanaC '13.

[11]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[12]  Josep Torrellas,et al.  Automatic Code Mapping on an Intelligent Memory Architecture , 2001, IEEE Trans. Computers.

[13]  Jaewook Shin,et al.  Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[14]  Hyesoon Kim,et al.  Instruction Offloading with HMC 2.0 Standard: A Case Study for Graph Traversals , 2015, MEMSYS.

[15]  Tong Wen Introduction to the X 10 Implementation of NPB MG , 2006 .

[16]  Gabriel H. Loh Nuwan Jayasena Mark H. Oskin Mark Nutter Da Ignatowski A Processing-in-Memory Taxonomy and a Case for Studying Fixed-function PIM , 2013 .

[17]  Florin Rusu,et al.  Speculative Approximations for Terascale Distributed Gradient Descent Optimization , 2015, DanaC@SIGMOD.

[18]  Dean M. Tullsen,et al.  Data-triggered Multithreading for Near-Data Processing , 2013 .

[19]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[20]  Yu Cheng,et al.  GLADE: big data analytics made easy , 2012, SIGMOD Conference.

[21]  Peter M. Kogge,et al.  The Characterization of Data Intensive Memory Workloads on Distributed PIM Systems , 2000, Intelligent Memory Systems.

[22]  Gabriel H. Loh,et al.  Thermal Feasibility of Die-Stacked Processing in Memory , 2014 .

[23]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[24]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[25]  G. Seroussi,et al.  Sidestep: Co-designed shiftable memory & software , 2012 .

[26]  Chun Chen,et al.  The architecture of the DIVA processing-in-memory chip , 2002, ICS '02.

[27]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[28]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[29]  Dave Brown,et al.  Supplementary Material for An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing , 2013 .

[30]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[31]  Sudhakar Yalamanchili,et al.  SIMT-based Logic Layers for Stacked DRAM Architectures: A Prototype , 2015, MEMSYS.

[32]  Florin Rusu,et al.  GLADE: a scalable framework for efficient analytics , 2012, OPSR.

[33]  Florin Rusu,et al.  Speculative Approximations for Terascale Analytics , 2014, ArXiv.

[34]  Duncan G. Elliott,et al.  Computational RAM: Implementing Processors in Memory , 1999, IEEE Des. Test Comput..