Accelerating Synchronization in Graph Analytics Using Moving Compute to Data Model on Tilera TILE-Gx72

The shared memory cache coherence paradigm is prevalent in modern multicores. However, as the number of cores increases, synchronization between threads limits performance scaling. Hardware-based core-to-core explicit messaging has been incorporated as an auxiliary communication capability to the shared memory cache coherence paradigm in the Tilera TILE-Gx72 multicore. We propose to utilize the auxiliary explicit messaging capability to build a moving computation to data model that accelerates synchronization using fine-grain serialization of critical code regions at dedicated cores. The proposed communication model exploits data locality and improves performance over both spin-lock and atomic instruction based synchronization methods for a set of parallelized graph analytic benchmarks executing on real world graphs. Experimental results show an average 34% better performance over spin-locks, and 15% over atomic instructions at 64 cores setup on TILE-Gx72.

[1]  Ryan A. Rossi,et al.  The Network Data Repository with Interactive Graph Analytics and Visualization , 2015, AAAI.

[2]  Rok Sosic,et al.  SNAP , 2016, ACM Trans. Intell. Syst. Technol..

[3]  Peter Wilson,et al.  Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for In-hardware Explicit Messaging , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[4]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[5]  Anant Agarwal,et al.  Anatomy of a message in the Alewife multiprocessor , 1993 .

[6]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[7]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[8]  Anant Agarwal,et al.  Anatomy of a message in the Alewife multiprocessor , 1993, ICS '93.

[9]  Christoforos E. Kozyrakis,et al.  Flexible architectural support for fine-grain scheduling , 2010, ASPLOS XV.

[10]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[11]  William J. Dally,et al.  On-Chip Active Messages for Speed, Scalability, and Efficiency , 2015, IEEE Transactions on Parallel and Distributed Systems.

[12]  Ren Wang,et al.  CAF: Core to core Communication Acceleration Framework , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[13]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[14]  Sanghoon Lee,et al.  HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[15]  Alan D. George,et al.  TSHMEM: Shared-Memory Parallel Computing on Tilera Many-Core Processors , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[16]  Omer Khan,et al.  CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores , 2015, 2015 IEEE International Symposium on Workload Characterization.