Livia: Data-Centric Computing Throughout the Memory Hierarchy
暂无分享,去创建一个
Nathan Beckmann | Daniel Sanchez | Shashwat Gupta | Elliot Lockerman | Axel Feldmann | Mohammad Bakhshalipour | Alexandru Stanescu | Nathan Beckmann | Daniel Sánchez | Elliot Lockerman | Shashwat Gupta | Mohammad Bakhshalipour | Alexandru Stanescu | Axel Feldmann
[1] Rahul Boyapati,et al. Active-Routing: Compute on the Way for Near-Data Processing , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[2] Yuan Xie,et al. Die Stacking Is Happening , 2018, IEEE Micro.
[3] Srinivas Devadas,et al. IMP: Indirect memory prefetcher , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[4] Daniel Sánchez,et al. Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[5] Mike Ignatowski,et al. TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.
[6] John Shalf,et al. Exascale Computing Technology Challenges , 2010, VECPAR.
[7] D. Lenoski,et al. The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.
[8] Calvin Lin,et al. Linearizing irregular memory accesses for improved correlated prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[9] Brian N. Bershad,et al. Lightweight remote procedure call , 1989, TOCS.
[10] J. Hennessy. A new golden age for computer architecture: Domain-specific hardware/software co-design, enhanced security, open instruction sets, and agile chip development , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[11] Feifei Li,et al. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[12] K. ReinhardtS.,et al. Tempest and typhoon , 1994 .
[13] Jaehyuk Huh,et al. Hybrid TLB coalescing: Improving TLB translation coverage under diverse fragmented memory allocations , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[14] Shreesha Srinath,et al. An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[15] Youngjin Kwon,et al. Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.
[16] Benoît Dupont de Dinechin,et al. A clustered manycore processor architecture for embedded and accelerated applications , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).
[17] John Wawrzynek,et al. Garp: a MIPS processor with a reconfigurable coprocessor , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).
[18] Jung Ho Ahn,et al. Accelerating linked-list traversal through near-data processing , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).
[19] Toshitsugu Yuba,et al. An Architecture Of A Dataflow Single Chip Processor , 1989, The 16th Annual International Symposium on Computer Architecture.
[20] Aamer Jaleel,et al. Adaptive insertion policies for high performance caching , 2007, ISCA '07.
[21] Seth Copen Goldstein,et al. Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.
[22] John Kubiatowicz,et al. A Hardware Accelerator for Tracing Garbage Collection , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[23] Mark Horowitz,et al. 1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).
[24] Daniel Sánchez,et al. Data-centric execution of speculative parallel programs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[25] Vikas Agarwal,et al. Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[26] Scott Hauck,et al. The Chimaera reconfigurable functional unit , 1997, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[27] Sam Ainsworth,et al. An Event-Triggered Programmable Prefetcher for Irregular Workloads , 2018, ASPLOS.
[28] Xiaosong Ma,et al. Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[29] J. Rose,et al. The effect of LUT and cluster size on deep-submicron FPGA performance and density , 2000, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[30] Karthikeyan Sankaralingam,et al. Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.
[31] André Seznec,et al. A case for two-way skewed-associative caches , 1993, ISCA '93.
[32] Milo M. K. Martin,et al. Why on-chip cache coherence is here to stay , 2012, Commun. ACM.
[33] Daniel Sánchez,et al. Whirlpool: Improving Dynamic Cache Management with Static Data Classification , 2016, ASPLOS.
[34] Christoforos E. Kozyrakis,et al. The ZCache: Decoupling Ways and Associativity , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[35] Henry Hoffmann,et al. On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.
[36] Ziqi Wang,et al. Building a Bw-Tree Takes More Than Just Buzz Words , 2018, SIGMOD Conference.
[37] Jing Liu,et al. Image annotation via graph learning , 2009, Pattern Recognit..
[38] Henry Hoffmann,et al. Remote Store Programming , 2010, HiPEAC.
[39] Omer Khan,et al. EM2: A Scalable Shared-Memory Multicore Architecture , 2010 .
[40] Kenneth B. Kent,et al. The VTR project: architecture and CAD for FPGAs from verilog to routing , 2012, FPGA '12.
[41] André Seznec,et al. Decoupled sectored caches: conciliating low tag implementation cost , 1994, ISCA '94.
[42] Margaret Martonosi,et al. Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[43] Rajeev Motwani,et al. The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.
[44] Onur Mutlu,et al. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[45] Eric Rotenberg,et al. Jigsaw: Scalable software-defined caches , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[46] Daniel Sánchez,et al. Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[47] Karthikeyan Sankaralingam,et al. Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[48] P. Hanrahan,et al. Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).
[49] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[50] Mingyu Gao,et al. HRL: Efficient and flexible reconfigurable logic for near-data processing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[51] Derek Chiou,et al. Minnow: Lightweight Offload Engines for Worklist Management and Worklist-Directed Prefetching , 2018, ASPLOS.
[52] Rachata Ausavarungnirun,et al. Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks , 2018, ASPLOS.
[53] Karthikeyan Sankaralingam,et al. Stream-dataflow acceleration , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[54] Guy E. Blelloch,et al. Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.
[55] SeznecA.. Decoupled sectored caches , 1994 .
[56] David E. Culler,et al. Monsoon: an explicit token-store architecture , 1998, ISCA '98.
[57] Kunle Olukotun,et al. Plasticine: A reconfigurable architecture for parallel patterns , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[58] Yunming Zhang,et al. Optimizing indirect memory references with milk , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).
[59] Jonathan W. Berry,et al. Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..
[60] Josep Torrellas,et al. PageForge: A Near-Memory Content-Aware Page-Merging Architecture , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[61] David Wentzlaff,et al. Scaling Datacenter Accelerators with Compute-Reuse Architectures , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[62] Anoop Gupta,et al. The Stanford FLASH multiprocessor , 1994, ISCA '94.
[63] Daniel Sánchez,et al. Exploiting semantic commutativity in hardware speculation , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[64] Etienne Sicard,et al. Introducing 10-nm FinFET technology in Microwind , 2017 .
[65] Christos Kozyrakis,et al. Smart Memories Polymorphic Chip Multiprocessor , 2009 .
[66] Osman S. Unsal,et al. Redundant Memory Mappings for fast access to large memories , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[67] Christos Faloutsos,et al. R-MAT: A Recursive Model for Graph Mining , 2004, SDM.
[68] Onur Mutlu,et al. Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).
[69] Emmett Kilgariff,et al. Fermi GF100 GPU Architecture , 2011, IEEE Micro.
[70] Eran Yahav,et al. Practical concurrent binary search trees via logical ordering , 2014, PPoPP '14.
[71] Christoforos E. Kozyrakis,et al. Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).
[72] Daniel Sánchez,et al. Jenga: Software-defined cache hierarchies , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[73] Onur Mutlu,et al. Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[74] Derek Chiou,et al. Worklist-Directed Prefetching , 2017, IEEE Computer Architecture Letters.
[75] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.
[76] Mahmut T. Kandemir,et al. Opportunistic Computing in GPU Architectures , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).
[77] Christoforos E. Kozyrakis,et al. Dynamic Fine-Grain Scheduling of Pipeline Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[78] James R. Larus,et al. Tempest and typhoon: user-level shared memory , 1994, ISCA '94.
[79] Emery D. Berger,et al. Grace: safe multithreaded programming for C/C++ , 2009, OOPSLA '09.
[80] William J. Dally,et al. SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[81] R. E. Kessler,et al. Cray T3D: a new dimension for Cray Research , 1993, Digest of Papers. Compcon Spring.
[82] Daniel Sánchez,et al. Scaling distributed cache hierarchies through computation and data co-scheduling , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[83] Guowei Zhang,et al. Leveraging Hardware Caches for Memoization , 2018, IEEE Computer Architecture Letters.
[84] Steven L. Scott,et al. Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.
[85] Xin-She Yang,et al. Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.
[86] David Blaauw,et al. Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[87] David A. Patterson,et al. Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server , 2015, 2015 IEEE International Symposium on Workload Characterization.
[88] Douglas J. Joseph,et al. Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.
[89] Nathan Beckmann,et al. PHI: Architectural Support for Synchronization- and Bandwidth-Efficient Commutative Scatter Updates , 2019, MICRO.
[90] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.
[91] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[92] Aamer Jaleel,et al. High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.
[93] Farheen Fatima Khan,et al. A study on the accuracy of minimum width transistor area in estimating FPGA layout area , 2017, Microprocess. Microsystems.
[94] David Blaauw,et al. Cache Automaton , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[95] Kiyoung Choi,et al. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).