Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories

Processing-in-memory (PIM) provides high bandwidth, massive parallelism, and high energy efficiency by implementing computations in main memory, therefore eliminating the overhead of data movement between CPU and memory. While most of the recent work focused on PIM in DRAM memory with 3D die-stacking technology, we propose to leverage the unique features of emerging non-volatile memory (NVM), such as resistance-based storage and current sensing, to enable efficient PIM design in NVM. We propose Pinatubo1, a Processing In Non-volatile memory ArchiTecture for bUlk Bitwise Operations. Instead of integrating complex logic inside the cost-sensitive memory, Pinatubo redesigns the read circuitry so that it can compute the bitwise logic of two or more memory rows very efficiently, and support one-step multi-row operations. The experimental results on data intensive graph processing and database applications show that Pinatubo achieves a ~500 x speedup, ~28000x energy saving on bitwise operations, and 1.12× overall speedup, 1.11× overall energy saving over the conventional processor.

[1]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[2]  Manuela M. Veloso,et al.  Fast and inexpensive color image segmentation for interactive robots , 2000, Proceedings. 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000) (Cat. No.00CH37113).

[3]  G. Renault,et al.  from the STAR experiment , 2004 .

[4]  Kesheng Wu,et al.  FastBit: An Efficient Indexing Technology For Accelerating Data-Intensive Science , 2005 .

[5]  Jun Yang,et al.  A durable and energy efficient main memory using phase change memory technology , 2009, ISCA '09.

[6]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[7]  Yoshihiro Ueda,et al.  A 64Mb MRAM with clamped-reference and adequate-reference schemes , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[8]  Roberto Bez,et al.  A 90nm 4Mb embedded phase-change memory with 1.2V 12ns read access time and 1MB/s write throughput , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[9]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[10]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11]  Martín Pedemonte,et al.  Bitwise operations for GPU implementation of genetic algorithms , 2011, GECCO '11.

[12]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[13]  Jung Ho Ahn,et al.  CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[14]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[15]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Meng-Fan Chang,et al.  An Offset-Tolerant Fast-Random-Read Current-Sampling-Based Sense Amplifier for Small-Cell-Current Nonvolatile Memory , 2013, IEEE Journal of Solid-State Circuits.

[17]  Eby G. Friedman,et al.  AC-DIMM: associative computing with STT-MRAM , 2013, ISCA.

[18]  Jing Li,et al.  1 Mb 0.41 µm² 2T-2R Cell Nonvolatile TCAM With Two-Bit Encoding and Clocked Self-Referenced Sensing , 2014, IEEE Journal of Solid-State Circuits.

[19]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[20]  H. Li,et al.  A learnable parallel processing architecture towards unity of memory and computing , 2015, Scientific Reports.

[21]  Huawei Li,et al.  ProPRAM: Exploiting the transparent logic resources in Non-Volatile Memory for Near Data Computing , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[22]  Tejas Karkhanis,et al.  Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..

[23]  Cong Xu,et al.  Memory and Storage System Design with Nonvolatile Memory Technologies , 2015, IPSJ Trans. Syst. LSI Des. Methodol..

[24]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[25]  Henk Corporaal,et al.  Memristor based computation-in-memory architecture for data-intensive applications , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[26]  Onur Mutlu,et al.  Fast Bulk Bitwise AND and OR in DRAM , 2015, IEEE Computer Architecture Letters.