Towards dataflow based graph processing

While existing researches greatly improve the performance of memory subsystem [4], they are still subject to the underlying modern processor. We divide the process of graph applications into three pieces of slots, which indicates hardware resources needed to process micro-ops (uops), and present the result in Figure 1(a). Although previous work has made great progress to significantly improve the performance of graph applications by optimizing 41% stalls resulting from the memory subsystem [5,6], a vast body of inefficiencies (35% slots wasted) inside the underlying processor is still unknown and seldom studied in existing work.

[1]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[2]  Zhongyuan Zhang,et al.  Community structure detection in social networks based on dictionary learning , 2011, Science China Information Sciences.

[3]  David A. Patterson,et al.  Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server , 2015, 2015 IEEE International Symposium on Workload Characterization.

[4]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[5]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[6]  Ozcan Ozturk,et al.  Energy Efficient Architecture for Graph Analytics Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[7]  Margaret Martonosi,et al.  Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).