论文信息 - CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores

CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores

Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, vision, financial and media processing. However, these suites lack graph applications that must be evaluated in the context of architectural design space exploration for futuristic multicores. This paper presents CRONO, a benchmark suite composed of multi-threaded graph algorithms for shared memory multicore processors. We analyze and characterize these benchmarks using a multicore simulator, as well as a real multicore machine setup. CRONO uses both synthetic and real world graphs. Our characterization shows that graph benchmarks are diverse and challenging in the context of scaling efficiency. They exhibit low locality due to unstructured memory access patterns, and incur fine-grain communication between threads. Energy overheads also occur due to nondeterministic memory and synchronization patterns on network connections. Our characterization reveals that these challenges remain in state-of-the-art graph algorithms, and in this context CRONO can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks in futuristic multicore processors.

[1] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[2] David A. Bader,et al. Scalable and High Performance Betweenness Centrality on the GPU , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3] Omer Khan,et al. Efficient parallelization of path planning workload on single-chip shared-memory multicores , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[4] Keshav Pingali,et al. A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[5] Steven A. Hofmeyr,et al. Load balancing on speed , 2010, PPoPP '10.

[6] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7] Daniel J. Sorin,et al. Exploring memory consistency for massively-threaded throughput-oriented processors , 2013, ISCA.

[8] David A. Bader,et al. GTgraph : A Synthetic Graph Generator Suite , 2006 .

[9] George Kurian,et al. The locality-aware adaptive cache coherence protocol , 2013, ISCA.

[10] Chen Sun,et al. DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[11] Andrew S. Grimshaw,et al. Scalable GPU graph traversal , 2012, PPoPP '12.

[12] David A. Bader,et al. Scalable Graph Exploration on Multicore Processors , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[13] Xin-She Yang,et al. Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[14] Keshav Pingali,et al. Deterministic galois: on-demand, portable and parameterless , 2014, ASPLOS.

[15] Guy E. Blelloch,et al. GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[16] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[18] Jennifer Widom,et al. GPS: a graph processing system , 2013, SSDBM.

[19] Matteo Frigo,et al. The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[20] Sudhakar Yalamanchili,et al. Characterization and analysis of dynamic parallelism in unstructured GPU applications , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[21] Jure Leskovec,et al. Discovering social circles in ego networks , 2012, ACM Trans. Knowl. Discov. Data.

[22] Bernard Gendron,et al. Parallel Branch-and-Branch Algorithms: Survey and Synthesis , 1994, Oper. Res..

[23] George Kurian,et al. Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[24] John R. Gilbert,et al. The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[25] Keshav Pingali,et al. Lonestar: A suite of parallel irregular programs , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[26] David Wentzlaff,et al. Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[27] E BlellochGuy,et al. Internally deterministic parallel algorithms can be fast , 2012 .

[28] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[29] Jure Leskovec,et al. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[30] Carlos Guestrin,et al. Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[31] Chen Sun,et al. Cross-layer Energy and Performance Evaluation of a Nanophotonic Manycore Processor System Using Real Application Workloads , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[32] Peter Wilson,et al. Efficient parallel packet processing using a shared memory many-core processor with hardware support to accelerate communication , 2015, 2015 IEEE International Conference on Networking, Architecture and Storage (NAS).

[33] G. Edward Suh,et al. Application-aware deadlock-free oblivious routing , 2009, ISCA '09.

[34] Pradeep Dubey,et al. Navigating the maze of graph analytics frameworks using massive graph datasets , 2014, SIGMOD Conference.

[35] Anantharaman Kalyanaraman,et al. Parallel Heuristics for Scalable Community Detection , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[36] LeskovecJure,et al. Discovering social circles in ego networks , 2014 .

[37] James Reinders,et al. Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[38] P. J. Narayanan,et al. Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[39] Hee-Seok Kim,et al. Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[40] Thomas H. Cormen,et al. Introduction to algorithms [2nd ed.] , 2001 .

[41] Kevin Skadron,et al. Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[42] Kunle Olukotun,et al. Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[43] Trevor Mudge,et al. MiBench: A free, commercially representative embedded benchmark suite , 2001 .