论文信息 - Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads

Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads

Graph processing is an important analysis technique for a wide range of big data applications. The ability to explicitly represent relationships between entities gives graph analytics a significant performance advantage over traditional relational databases. However, at the microarchitecture level, performance is bounded by the inefficiencies in the memory subsystem for single-machine in-memory graph analytics. This paper consists of two contributions in which we analyze and optimize the memory hierarchy for graph processing workloads. First, we perform an in-depth data-type-aware characterization of graph processing workloads on a simulated multi-core architecture. We analyze 1) the memory-level parallelism in an out-of-order core and 2) the request reuse distance in the cache hierarchy. We find that the load-load dependency chains involving different application data types form the primary bottleneck in achieving a high memory-level parallelism. We also observe that different graph data types exhibit heterogeneous reuse distances. As a result, the private L2 cache has negligible contribution to performance, whereas the shared L3 cache shows higher performance sensitivity. Second, based on our profiling observations, we propose DROPLET, a Data-awaRe decOuPLed prEfeTcher for graph applications. DROPLET prefetches different graph data types differently according to their inherent reuse distances. In addition, DROPLET is physically decoupled to overcome the serialization due to the dependency chains between different data types. DROPLET achieves 19%-102% performance improvement over a no-prefetch baseline, 9%-74% performance improvement over a conventional stream prefetcher, 14%-74% performance improvement over a Variable Length Delta Prefetcher, and 19%-115% performance improvement over a delta correlation prefetcher implemented as a global history buffer. DROPLET performs 4%-12.5% better than a monolithic L1 prefetcher similar to the state-of-the-art prefetcher for graphs.

[1] David Blaauw,et al. Compute Caches , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[2] Onur Mutlu,et al. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[3] Mohan Kumar,et al. Mosaic: Processing a Trillion-Edge Graph on a Single Machine , 2017, EuroSys.

[4] Jimmy J. Lin,et al. GraphJet: Real-Time Content Recommendations at Twitter , 2016, Proc. VLDB Endow..

[5] Jimmy Lin. Scale Up or Scale Out for Graph Processing? , 2018, IEEE Internet Computing.

[6] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[7] Willy Zwaenepoel,et al. X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[8] Dirk Grunwald,et al. Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[9] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .

[10] Onur Mutlu,et al. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[11] Joseph Gonzalez,et al. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[12] Ruby B. Lee,et al. Random Fill Cache Architecture , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[13] Ramyad Hadidi,et al. GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[14] Jinha Kim,et al. TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC , 2013, KDD.

[15] Kei Hiraki,et al. Access Map Pattern Matching for High Performance Data Cache Prefetch , 2011, J. Instr. Level Parallelism.

[16] Vladimir Vlassov,et al. High-Level Programming Abstractions for Distributed Graph Processing , 2016, IEEE Transactions on Knowledge and Data Engineering.

[17] James E. Smith,et al. Data Cache Prefetching Using a Global History Buffer , 2005, IEEE Micro.

[18] Guy E. Blelloch,et al. GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[19] Feifei Li,et al. Graph Analytics Through Fine-Grained Parallelism , 2016, SIGMOD Conference.

[20] David A. Patterson,et al. The GAP Benchmark Suite , 2015, ArXiv.

[21] Hideki Ando,et al. MLP-aware dynamic instruction window resizing for adaptively exploiting both ILP and MLP , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22] Willy Zwaenepoel,et al. Everything you always wanted to know about multicore graph processing but were afraid to ask , 2017, USENIX Annual Technical Conference.

[23] Chia-Lin Yang,et al. Push vs. pull: data movement for linked data structures , 2000, ICS '00.

[24] Haixun Wang,et al. Trinity: a distributed graph engine on a memory cloud , 2013, SIGMOD '13.

[25] Stijn Eyerman,et al. An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..

[26] Lina Sawalha,et al. ×86 computer architecture simulators: A comparative study , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[27] Judy Qiu,et al. Performance Characterization of Multi-threaded Graph Processing Applications on Many-Integrated-Core Architecture , 2017, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[28] Srinivas Devadas,et al. IMP: Indirect memory prefetcher , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[29] Jignesh M. Patel,et al. Data prefetching by dependence graph precomputation , 2001, ISCA 2001.

[30] Rok Sosic,et al. SNAP , 2016, ACM Trans. Intell. Syst. Technol..

[31] Jure Leskovec,et al. {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[32] Jimmy J. Lin,et al. WTF: the who to follow service at Twitter , 2013, WWW.

[33] Andreas Moshovos,et al. Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[34] Kiyoung Choi,et al. A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[35] David A. Patterson,et al. Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server , 2015, 2015 IEEE International Symposium on Workload Characterization.

[36] Ching-Yung Lin,et al. GraphBIG: understanding graph computing in the context of industrial solutions , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[37] Onur Mutlu,et al. Gather-Scatter DRAM: In-DRAM address translation to improve the spatial locality of non-unit strided accesses , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38] Brian Fahs,et al. Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[39] Jinchun Kim,et al. Path confidence based lookahead prefetching , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[40] Onur Mutlu,et al. Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, ISCA.

[41] Avery Ching,et al. One Trillion Edges: Graph Processing at Facebook-Scale , 2015, Proc. VLDB Endow..

[42] Ozcan Ozturk,et al. Energy Efficient Architecture for Graph Analytics Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[43] Kang Chen,et al. Wonderland: A Novel Abstraction-Based Out-Of-Core Graph Processing System , 2018, ASPLOS.

[44] Lieven Eeckhout,et al. Using cycle stacks to understand scaling bottlenecks in multi-threaded workloads , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[45] Dirk Grunwald,et al. A stateless, content-directed data prefetching mechanism , 2002, ASPLOS X.

[46] Guy E. Blelloch,et al. Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[47] Jure Leskovec,et al. Pixie: A System for Recommending 3+ Billion Items to 200+ Million Users in Real-Time , 2017, WWW.

[48] Christopher J. Hughes,et al. Memory-side prefetching for linked data structures for processor-in-memory systems , 2005, J. Parallel Distributed Comput..

[49] Tianshi Chen,et al. TuNao: A High-Performance and Energy-Efficient Reconfigurable Accelerator for Graph Processing , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[50] Yiran Chen,et al. GraphR: Accelerating Graph Processing Using ReRAM , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[51] Onur Mutlu,et al. Prefetch-Aware DRAM Controllers , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[52] Yuan Xie,et al. Exploring Core and Cache Hierarchy Bottlenecks in Graph Processing Workloads , 2018, IEEE Computer Architecture Letters.

[53] Mahmut T. Kandemir,et al. Meeting midway: Improving CMP performance with memory-side prefetching , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[54] Onur Mutlu,et al. Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[55] Lieven Eeckhout,et al. Sniper: scalable and accurate parallel multi-core simulation , 2012 .

[56] Carlos Guestrin,et al. Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[57] Pradeep Dubey,et al. Navigating the maze of graph analytics frameworks using massive graph datasets , 2014, SIGMOD Conference.

[58] Margaret Martonosi,et al. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[59] Wenguang Chen,et al. GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[60] Pararth Shah,et al. Ringo: Interactive Graph Analytics on Big-Memory Machines , 2015, SIGMOD Conference.

[61] Sachin Katti,et al. Parallel Graph Processing on Modern Multi-core Servers: New Findings and Remaining Challenges , 2016, 2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS).

[62] Seth H. Pugsley,et al. Efficiently prefetching complex address patterns , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[63] Paolo Faraboschi,et al. Parallel Graph Processing: Prejudice and State of the Art , 2016, ICPE.

[64] Pierre Michaud. Best-offset hardware prefetching , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[65] Sam Ainsworth,et al. Graph Prefetching Using Data Structure Knowledge , 2016, ICS.