Exploring Core and Cache Hierarchy Bottlenecks in Graph Processing Workloads

Graph processing is an important analysis technique for a wide range of big data problems. The ability to explicitly represent relationships between entities gives graph analytics significant performance advantage over traditional relational databases. In this paper, we perform an in-depth data-aware characterization of graph processing workloads on a simulated multi-core architecture, find bottlenecks in the core and the cache hierarchy that are not highlighted by previous characterization work, and analyze the behavior of the specific application data type causing the corresponding bottleneck. We find that load-load dependency chains involving different application data types form the primary bottleneck in achieving a high memory-level parallelism in graph processing workloads. We also observe that the private L2 cache has a negligible contribution to performance, whereas the shared L3 cache has higher performance sensitivity. In addition, we present a study on the effectiveness of several replacement policies. Finally, we study the relationship between different graph algorithms and the access volumes to the different data types. Overall, we provide useful insights and guidelines toward developing a more optimized CPU-based architecture for high performance graph processing.

[1]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[3]  Feifei Li,et al.  Graph Analytics Through Fine-Grained Parallelism , 2016, SIGMOD Conference.

[4]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[5]  Sachin Katti,et al.  Parallel Graph Processing on Modern Multi-core Servers: New Findings and Remaining Challenges , 2016, 2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS).

[6]  Jignesh M. Patel,et al.  Data prefetching by dependence graph precomputation , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[7]  Onur Mutlu,et al.  Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, ISCA.

[8]  Pararth Shah,et al.  Ringo: Interactive Graph Analytics on Big-Memory Machines , 2015, SIGMOD Conference.

[9]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[10]  David A. Patterson,et al.  Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server , 2015, 2015 IEEE International Symposium on Workload Characterization.

[11]  Pradeep Dubey,et al.  Navigating the maze of graph analytics frameworks using massive graph datasets , 2014, SIGMOD Conference.

[12]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .