Active-Routing: Compute on the Way for Near-Data Processing
暂无分享,去创建一个
Rahul Boyapati | Ki Hwan Yum | Jiayi Huang | Eun Jung Kim | Pritam Majumder | Sungkeun Kim | Ramprakash Reddy Puli | Eun Jung Kim | K. H. Yum | Pritam Majumder | Jiayi Huang | Sungkeun Kim | R. Boyapati
[1] Omer Khan,et al. CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores , 2015, 2015 IEEE International Symposium on Workload Characterization.
[2] Jaejin Lee,et al. 25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).
[3] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.
[4] Jung Ho Ahn,et al. Accelerating linked-list traversal through near-data processing , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).
[5] Yang Yang,et al. Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.
[6] Matthew Poremba,et al. There and back again: Optimizing the interconnect in networks of memory cubes , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[7] Jung Ho Ahn,et al. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[8] Babak Falsafi,et al. The mondrian data engine , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[9] Yuangang Wang,et al. A unified memory network architecture for in-memory computing in commodity servers , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[10] Ramyad Hadidi,et al. GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[11] Kiyoung Choi,et al. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[12] Feifei Li,et al. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[13] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .
[14] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[15] Onur Mutlu,et al. Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[16] J. Thomas Pawlowski,et al. Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).
[17] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.
[18] Ralph Grishman,et al. The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.
[19] Rachata Ausavarungnirun,et al. Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks , 2018, ASPLOS.
[20] William J. Dally,et al. Route packets, not wires: on-chip inteconnection networks , 2001, DAC '01.
[21] Hyoukjun Kwon,et al. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.
[22] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[23] Philip Heidelberger,et al. The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[24] Onur Mutlu,et al. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[25] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.
[26] Kiyoung Choi,et al. A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[27] Yuqing Zhu,et al. BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[28] Scott A. Mahlke,et al. In-Memory Data Parallel Processor , 2018, ASPLOS.
[29] Kiyoung Choi,et al. AIM , 2016 .
[30] Yoonho Park,et al. Data access optimization in a processing-in-memory system , 2015, Conf. Computing Frontiers.
[31] O Seongil,et al. McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[32] Jure Leskovec,et al. {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .
[33] Mehrzad Samadi,et al. Memory-centric system interconnect design with hybrid memory cubes , 2013, PACT 2013.
[34] Natalie D. Enright Jerger,et al. Supporting efficient collective communication in NoCs , 2012, IEEE International Symposium on High-Performance Comp Architecture.
[35] Seth Copen Goldstein,et al. Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.
[36] David Blaauw,et al. Compute Caches , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[37] Gabriel H. Loh,et al. 3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.
[38] Gregory F. Pfister,et al. “Hot spot” contention and combining in multistage interconnection networks , 1985, IEEE Transactions on Computers.
[39] Avinash Sodani,et al. Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).
[40] Ki-Seok Chung,et al. CasHMC: A Cycle-Accurate Simulator for Hybrid Memory Cube , 2017, IEEE Computer Architecture Letters.
[41] Tejas Karkhanis,et al. Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..
[42] Jacob Nelson,et al. IncBricks: Toward In-Network Computation with an In-Network Cache , 2017, ASPLOS.
[43] Daniel Sánchez,et al. Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[44] J. Jeddeloh,et al. Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).
[45] Aamer Jaleel,et al. Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..
[46] Jung Ho Ahn,et al. Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[47] Kevin Skadron,et al. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).
[48] Dhabaleswar K. Panda,et al. Global reduction in wormhole k-ary n-cube networks with multidestination exchange worms , 1995, Proceedings of 9th International Parallel Processing Symposium.
[49] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[50] Jaeha Kim,et al. Memory-centric system interconnect design with Hybrid Memory Cubes , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[51] Hairong Kuang,et al. The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).