Towards Near-Data Processing of Compare Operations in 3D-Stacked Memory

The gap between the processing speed and memory access speed of the modern multi-core systems has become a bottleneck for the emerging data-intensive workloads. In this scenario, it has become a smarter idea to move some amount of computation closer to the data, thus stimulating the concept of near-data processing (NDP). Compare or scanning, the core operations of many applications, typically in a database, can leverage the benefits of NDP. We propose near-data compare unit (NDCU), a less-invasive hardware, that can be integrated with the existing ecosystem of the hybrid memory cube (HMC). While integrating NDCU, we have designed two full-system architectures, one is lighter NDP with no parallelism (NNP) and the second is NDP with vault level parallelism (NVLP). While the first architecture is more power and area efficient, the second one is very fast with negligible overheads. With the motive of carrying out scan operation, we have specifically implemented 'compare-n-hit', 'compare-n-count' and 'compare-n-max' operations on both row-store and column-store databases and found significant improvements over conventional CPU-based system. We get around 2.3x and 37x performance improvement in NNP and NVLP architectures respectively. In both the designs, we reduce the energy consumption by around 8x on an average.

[1]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[2]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[3]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[4]  Onur Mutlu,et al.  Fast Bulk Bitwise AND and OR in DRAM , 2015, IEEE Computer Architecture Letters.

[5]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Franz Franchetti,et al.  A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing , 2013, 2013 IEEE International 3D Systems Integration Conference (3DIC).

[7]  Franz Franchetti,et al.  Data reorganization in memory using 3D-stacked DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[8]  Mingyu Gao,et al.  HRL: Efficient and flexible reconfigurable logic for near-data processing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[9]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[10]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[11]  Seung-Moon Yoo,et al.  FlexRAM: Toward an advanced Intelligent Memory system , 1999, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[12]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[13]  Kunle Olukotun,et al.  Hardware acceleration of database operations , 2014, FPGA.

[14]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[15]  Kiyoung Choi,et al.  Excavating the Hidden Parallelism Inside DRAM Architectures With Buffered Compares , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[16]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[17]  Luca Benini,et al.  DRAM or no-DRAM? Exploring linear solver architectures for image domain warping in 28 nm CMOS , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[19]  Kiyoung Choi,et al.  Buffered compares: Excavating the hidden parallelism inside DRAM architectures with lightweight logic , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[20]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[21]  Gabriel H. Loh,et al.  Thermal analysis of a 3D die-stacked high-performance microprocessor , 2006, GLSVLSI '06.

[22]  Manos Athanassoulis,et al.  Beyond the Wall: Near-Data Processing for Databases , 2015, DaMoN.

[23]  Xuedong Chen,et al.  The Star Schema Benchmark and Augmented Fact Table Indexing , 2009, TPCTC.

[24]  Feifei Li,et al.  NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).