PIMS: a lightweight processing-in-memory accelerator for stencil computations
暂无分享,去创建一个
John D. Leidel | Jie Li | Antonino Tumeo | Brody Williams | Jie Li | Xi Wang | Yong Chen
[1] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.
[2] Satoshi Matsuoka,et al. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[3] Luca Benini,et al. High performance AXI-4.0 based interconnect for extensible smart memory cubes , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[4] Kiyoung Choi,et al. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[5] Antonino Tumeo,et al. MAC: Memory Access Coalescer for 3D-Stacked Memory , 2019, ICPP.
[6] Onur Mutlu,et al. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[7] Gabriel H. Loh Nuwan Jayasena Mark H. Oskin Mark Nutter Da Ignatowski. A Processing-in-Memory Taxonomy and a Case for Studying Fixed-function PIM , 2013 .
[8] Christoph Hagleitner,et al. An architecture for near-data processing systems , 2016, Conf. Computing Frontiers.
[9] Richard Veras,et al. A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.
[10] Michael A. Frumkin,et al. Tight bounds on cache use for stencil operations on rectangular grids , 2002, JACM.
[11] P.M. Kogge,et al. Pursuing a petaflop: point designs for 100 TF computers using PIM technologies , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).
[12] P. Sadayappan,et al. High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.
[13] Gerhard Wellein,et al. Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model , 2014, ICS.
[14] Yong Chen,et al. HMC-Sim: A Simulation Framework for Hybrid Memory Cube Devices , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.
[15] Weiqiang Wang,et al. In-Core Optimization of High-Order Stencil Computations , 2009, PDPTA.
[16] Gerhard Wellein,et al. Exploring performance and power properties of modern multi‐core chips via simple machine models , 2012, Concurr. Comput. Pract. Exp..
[17] Kiyoung Choi,et al. A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[18] Dong Li,et al. Performance Implications of Processing-in-Memory Designs on Data-Intensive Applications , 2016, 2016 45th International Conference on Parallel Processing Workshops (ICPPW).
[19] Hans-Peter Seidel,et al. Cache Accurate Time Skewing in Iterative Stencil Computations , 2011, 2011 International Conference on Parallel Processing.
[20] Jung Ho Ahn,et al. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[21] Tejas Karkhanis,et al. Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..
[22] Alexandros Labrinidis,et al. Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..
[23] Dietmar Fey,et al. High Performance Stencil Code Algorithms for GPGPUs , 2011, ICCS.
[24] Samuel Williams,et al. Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..
[25] J. Jeddeloh,et al. Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).
[26] Ramyad Hadidi,et al. GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[27] Jose Renau,et al. Programming the FlexRAM parallel intelligent memory system , 2003, PPoPP '03.
[28] Christoforos E. Kozyrakis,et al. Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).
[29] Robert Strzodka,et al. Impact of System and Cache Bandwidth on Stencil Computations Across Multiple Processor Generations , 2011 .
[30] Michael Wolfe,et al. More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).
[31] Mike Ignatowski,et al. TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.
[32] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[33] Eduard Ayguadé,et al. Another Trip to the Wall: How Much Will Stacked DRAM Benefit HPC? , 2015, MEMSYS.
[34] Sudhakar Yalamanchili,et al. Demystifying the characteristics of 3D-stacked memories: A case study for Hybrid Memory Cube , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).
[35] Paul Rosenfeld,et al. Performance Exploration of the Hybrid Memory Cube , 2014 .
[36] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.
[37] Volker Strumpen,et al. Cache oblivious stencil computations , 2005, ICS '05.
[38] M. Oskin,et al. Active Pages: a computation model for intelligent memory , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).
[39] Holger Fröning,et al. Exploring Time and Energy for Complex Accesses to a Hybrid Memory Cube , 2016, MEMSYS.