Dissecting the Graphcore IPU Architecture via Microbenchmarking

This report focuses on the architecture and performance of the Intelligence Processing Unit (IPU), a novel, massively parallel platform recently introduced by Graphcore and aimed at Artificial Intelligence/Machine Learning (AI/ML) workloads. We dissect the IPU's performance behavior using microbenchmarks that we crafted for the purpose. We study the IPU's memory organization and performance. We study the latency and bandwidth that the on-chip and off-chip interconnects offer, both in point-to-point transfers and in a spectrum of collective operations, under diverse loads. We evaluate the IPU's compute power over matrix multiplication, convolution, and AI/ML primitives. We discuss actual performance in comparison with its theoretical limits. Our findings reveal how the IPU's architectural design affects its performance. Moreover, they offer simple mental models to predict an application's performance on the IPU, on the basis of the computation and communication steps it involves. This report is the natural extension to a novel architecture of a continuing effort of ours that focuses on the microbenchmark-based discovery of massively parallel architectures.

[1]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[2]  Pierre L'Ecuyer,et al.  TestU01: A C library for empirical testing of random number generators , 2006, TOMS.

[3]  Sebastiano Vigna,et al.  Scrambled Linear Pseudorandom Number Generators , 2018, ACM Trans. Math. Softw..

[4]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[5]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Fabrizio Petrini,et al.  Cell Multiprocessor Communication Network: Built for Speed , 2006, IEEE Micro.

[8]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[10]  Marco Maggioni,et al.  Dissecting the NVidia Turing T4 GPU via Microbenchmarking , 2019, ArXiv.

[11]  Mark A. Moraes,et al.  Parallel random numbers: As easy as 1, 2, 3 , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Dhabaleswar K. Panda,et al.  Microbenchmark performance comparison of high-speed cluster interconnects , 2004, IEEE Micro.

[14]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[15]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).