Active-Routing: Compute on the Way for Near-Data Processing

With the explosion of data availability and fast data analytic requirements, data-intensive applications, characterized by their large memory footprint and low data reuse rate, are placing a significant amount of stress on modern memory systems and communication infrastructure. These workloads, ranging from data analytics to machine learning, exhibit considerable amount of aggregation operations over myriad of data, whose performance is limited due to the widening gap between high CPU compute density and deficient memory bandwidth. In this paper, we propose Active-Routing, an In-Network Compute Architecture enabling compute on the way for Near-Data Processing (NDP), which reduces the data movements across the memory hierarchy with a three-phase processing schedule. It first moves the computation to the data location and constructs an Active-Routing tree. Concurrently, Data Processing phase initiates when computations are offloaded to the memory network switches. Then, the network is enabled with compute capability to optimize the aggregation operations on the way along the dynamically built routing tree to reduce the traffic and parallelize computations in the memory network. Our evaluations show that Active-Routing can achieve up to 6x speedup with an average of 75% performance improvement across various benchmarks compared to a Baseline system integrated with a die-stacked memory network.

[1]  Omer Khan,et al.  CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores , 2015, 2015 IEEE International Symposium on Workload Characterization.

[2]  Jaejin Lee,et al.  25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Jung Ho Ahn,et al.  Accelerating linked-list traversal through near-data processing , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[5]  Yang Yang,et al.  Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[6]  Matthew Poremba,et al.  There and back again: Optimizing the interconnect in networks of memory cubes , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[7]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[8]  Babak Falsafi,et al.  The mondrian data engine , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[9]  Yuangang Wang,et al.  A unified memory network architecture for in-memory computing in commodity servers , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Ramyad Hadidi,et al.  GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[11]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[12]  Feifei Li,et al.  NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[13]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[14]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[15]  Onur Mutlu,et al.  Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[17]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[18]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[19]  Rachata Ausavarungnirun,et al.  Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks , 2018, ASPLOS.

[20]  William J. Dally,et al.  Route packets, not wires: on-chip inteconnection networks , 2001, DAC '01.

[21]  Hyoukjun Kwon,et al.  MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Philip Heidelberger,et al.  The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[24]  Onur Mutlu,et al.  Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[25]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[26]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[27]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[28]  Scott A. Mahlke,et al.  In-Memory Data Parallel Processor , 2018, ASPLOS.

[29]  Kiyoung Choi,et al.  AIM , 2016 .

[30]  Yoonho Park,et al.  Data access optimization in a processing-in-memory system , 2015, Conf. Computing Frontiers.

[31]  O Seongil,et al.  McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[32]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[33]  Mehrzad Samadi,et al.  Memory-centric system interconnect design with hybrid memory cubes , 2013, PACT 2013.

[34]  Natalie D. Enright Jerger,et al.  Supporting efficient collective communication in NoCs , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[35]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[36]  David Blaauw,et al.  Compute Caches , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[37]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[38]  Gregory F. Pfister,et al.  “Hot spot” contention and combining in multistage interconnection networks , 1985, IEEE Transactions on Computers.

[39]  Avinash Sodani,et al.  Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[40]  Ki-Seok Chung,et al.  CasHMC: A Cycle-Accurate Simulator for Hybrid Memory Cube , 2017, IEEE Computer Architecture Letters.

[41]  Tejas Karkhanis,et al.  Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..

[42]  Jacob Nelson,et al.  IncBricks: Toward In-Network Computation with an In-Network Cache , 2017, ASPLOS.

[43]  Daniel Sánchez,et al.  Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[44]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[45]  Aamer Jaleel,et al.  Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[46]  Jung Ho Ahn,et al.  Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[48]  Dhabaleswar K. Panda,et al.  Global reduction in wormhole k-ary n-cube networks with multidestination exchange worms , 1995, Proceedings of 9th International Parallel Processing Symposium.

[49]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[50]  Jaeha Kim,et al.  Memory-centric system interconnect design with Hybrid Memory Cubes , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[51]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).