Livia: Data-Centric Computing Throughout the Memory Hierarchy

In order to scale, future systems will need to dramatically reduce data movement. Data movement is expensive in current designs because (i) traditional memory hierarchies force computation to happen unnecessarily far away from data and (ii) processing-in-memory approaches fail to exploit locality. We propose Memory Services, a flexible programming model that enables data-centric computing throughout the memory hierarchy. In Memory Services, applications express functionality as graphs of simple tasks, each task indicating the data it operates on. We design and evaluate Livia, a new system architecture for Memory Services that dynamically schedules tasks and data at the location in the memory hierarchy that minimizes overall data movement. Livia adds less than 3% area overhead to a tiled multicore and accelerates challenging irregular workloads by 1.3 × to 2.4 × while reducing dynamic energy by 1.2× to 4.7×.

[1]  Rahul Boyapati,et al.  Active-Routing: Compute on the Way for Near-Data Processing , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[2]  Yuan Xie,et al.  Die Stacking Is Happening , 2018, IEEE Micro.

[3]  Srinivas Devadas,et al.  IMP: Indirect memory prefetcher , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Daniel Sánchez,et al.  Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[6]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[7]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[8]  Calvin Lin,et al.  Linearizing irregular memory accesses for improved correlated prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Brian N. Bershad,et al.  Lightweight remote procedure call , 1989, TOCS.

[10]  J. Hennessy A new golden age for computer architecture: Domain-specific hardware/software co-design, enhanced security, open instruction sets, and agile chip development , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[11]  Feifei Li,et al.  NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[12]  K. ReinhardtS.,et al.  Tempest and typhoon , 1994 .

[13]  Jaehyuk Huh,et al.  Hybrid TLB coalescing: Improving TLB translation coverage under diverse fragmented memory allocations , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[14]  Shreesha Srinath,et al.  An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Youngjin Kwon,et al.  Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.

[16]  Benoît Dupont de Dinechin,et al.  A clustered manycore processor architecture for embedded and accelerated applications , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[17]  John Wawrzynek,et al.  Garp: a MIPS processor with a reconfigurable coprocessor , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[18]  Jung Ho Ahn,et al.  Accelerating linked-list traversal through near-data processing , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[19]  Toshitsugu Yuba,et al.  An Architecture Of A Dataflow Single Chip Processor , 1989, The 16th Annual International Symposium on Computer Architecture.

[20]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[21]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[22]  John Kubiatowicz,et al.  A Hardware Accelerator for Tracing Garbage Collection , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[23]  Mark Horowitz,et al.  1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[24]  Daniel Sánchez,et al.  Data-centric execution of speculative parallel programs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[26]  Scott Hauck,et al.  The Chimaera reconfigurable functional unit , 1997, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[27]  Sam Ainsworth,et al.  An Event-Triggered Programmable Prefetcher for Irregular Workloads , 2018, ASPLOS.

[28]  Xiaosong Ma,et al.  Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[29]  J. Rose,et al.  The effect of LUT and cluster size on deep-submicron FPGA performance and density , 2000, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[30]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[31]  André Seznec,et al.  A case for two-way skewed-associative caches , 1993, ISCA '93.

[32]  Milo M. K. Martin,et al.  Why on-chip cache coherence is here to stay , 2012, Commun. ACM.

[33]  Daniel Sánchez,et al.  Whirlpool: Improving Dynamic Cache Management with Static Data Classification , 2016, ASPLOS.

[34]  Christoforos E. Kozyrakis,et al.  The ZCache: Decoupling Ways and Associativity , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[35]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[36]  Ziqi Wang,et al.  Building a Bw-Tree Takes More Than Just Buzz Words , 2018, SIGMOD Conference.

[37]  Jing Liu,et al.  Image annotation via graph learning , 2009, Pattern Recognit..

[38]  Henry Hoffmann,et al.  Remote Store Programming , 2010, HiPEAC.

[39]  Omer Khan,et al.  EM2: A Scalable Shared-Memory Multicore Architecture , 2010 .

[40]  Kenneth B. Kent,et al.  The VTR project: architecture and CAD for FPGAs from verilog to routing , 2012, FPGA '12.

[41]  André Seznec,et al.  Decoupled sectored caches: conciliating low tag implementation cost , 1994, ISCA '94.

[42]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[43]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[44]  Onur Mutlu,et al.  Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[45]  Eric Rotenberg,et al.  Jigsaw: Scalable software-defined caches , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[46]  Daniel Sánchez,et al.  Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[48]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[49]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[50]  Mingyu Gao,et al.  HRL: Efficient and flexible reconfigurable logic for near-data processing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[51]  Derek Chiou,et al.  Minnow: Lightweight Offload Engines for Worklist Management and Worklist-Directed Prefetching , 2018, ASPLOS.

[52]  Rachata Ausavarungnirun,et al.  Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks , 2018, ASPLOS.

[53]  Karthikeyan Sankaralingam,et al.  Stream-dataflow acceleration , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[54]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[55]  SeznecA. Decoupled sectored caches , 1994 .

[56]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[57]  Kunle Olukotun,et al.  Plasticine: A reconfigurable architecture for parallel patterns , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[58]  Yunming Zhang,et al.  Optimizing indirect memory references with milk , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[59]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[60]  Josep Torrellas,et al.  PageForge: A Near-Memory Content-Aware Page-Merging Architecture , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[61]  David Wentzlaff,et al.  Scaling Datacenter Accelerators with Compute-Reuse Architectures , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[62]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[63]  Daniel Sánchez,et al.  Exploiting semantic commutativity in hardware speculation , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[64]  Etienne Sicard,et al.  Introducing 10-nm FinFET technology in Microwind , 2017 .

[65]  Christos Kozyrakis,et al.  Smart Memories Polymorphic Chip Multiprocessor , 2009 .

[66]  Osman S. Unsal,et al.  Redundant Memory Mappings for fast access to large memories , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[67]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[68]  Onur Mutlu,et al.  Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[69]  Emmett Kilgariff,et al.  Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[70]  Eran Yahav,et al.  Practical concurrent binary search trees via logical ordering , 2014, PPoPP '14.

[71]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[72]  Daniel Sánchez,et al.  Jenga: Software-defined cache hierarchies , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[73]  Onur Mutlu,et al.  Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[74]  Derek Chiou,et al.  Worklist-Directed Prefetching , 2017, IEEE Computer Architecture Letters.

[75]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[76]  Mahmut T. Kandemir,et al.  Opportunistic Computing in GPU Architectures , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[77]  Christoforos E. Kozyrakis,et al.  Dynamic Fine-Grain Scheduling of Pipeline Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[78]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[79]  Emery D. Berger,et al.  Grace: safe multithreaded programming for C/C++ , 2009, OOPSLA '09.

[80]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[81]  R. E. Kessler,et al.  Cray T3D: a new dimension for Cray Research , 1993, Digest of Papers. Compcon Spring.

[82]  Daniel Sánchez,et al.  Scaling distributed cache hierarchies through computation and data co-scheduling , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[83]  Guowei Zhang,et al.  Leveraging Hardware Caches for Memoization , 2018, IEEE Computer Architecture Letters.

[84]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[85]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[86]  David Blaauw,et al.  Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[87]  David A. Patterson,et al.  Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server , 2015, 2015 IEEE International Symposium on Workload Characterization.

[88]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[89]  Nathan Beckmann,et al.  PHI: Architectural Support for Synchronization- and Bandwidth-Efficient Commutative Scatter Updates , 2019, MICRO.

[90]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[91]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[92]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[93]  Farheen Fatima Khan,et al.  A study on the accuracy of minimum width transistor area in estimating FPGA layout area , 2017, Microprocess. Microsystems.

[94]  David Blaauw,et al.  Cache Automaton , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[95]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).