ZNNi - Maximizing the Inference Throughput of 3D Convolutional Networks on Multi-Core CPUs and GPUs

Sliding window convolutional networks (ConvNets) have become a popular approach to computer vision problems such as image segmentation, and object detection and localization. Here we consider the problem of inference, the application of a previously trained ConvNet, with emphasis on 3D images. Our goal is to maximize throughput, defined as average number of output voxels computed per unit time. Other things being equal, processing a larger image tends to increase throughput, because fractionally less computation is wasted on the borders of the image. It follows that an apparently slower algorithm may end up having higher throughput if it can process a larger image within the constraint of the available RAM. We introduce novel CPU and GPU primitives for convolutional and pooling layers, which are designed to minimize memory overhead. The primitives include convolution based on highly efficient pruned FFTs. Our theoretical analyses and empirical tests reveal a number of interesting findings. For some ConvNet architectures, cuDNN is outperformed by our FFT-based GPU primitives, and these in turn can be outperformed by our CPU primitives. The CPU manages to achieve higher throughput because of its fast access to more RAM. A novel primitive in which the GPU accesses host RAM can significantly increase GPU throughput. Finally, a CPU-GPU algorithm achieves the greatest throughput of all, 10x or more than other publicly available implementations of sliding window 3D ConvNets. All of our code has been made available as open source project.

[1]  Joseph F. Murray,et al.  Supervised Learning of Image Restoration with Convolutional Networks , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[2]  Luca Maria Gambardella,et al.  Fast image scanning with deep max-pooling convolutional neural networks , 2013, 2013 IEEE International Conference on Image Processing.

[3]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[4]  Klaus Kofler,et al.  Performance and Scalability of GPU-Based Convolutional Neural Networks , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[5]  Luca Maria Gambardella,et al.  Flexible, High Performance Convolutional Neural Networks for Image Classification , 2011, IJCAI.

[6]  Nicolae Popovici,et al.  Putting intel® threading building blocks to work , 2008, IWMSE '08.

[7]  H. Sebastian Seung,et al.  ZNN -- A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-core and Many-Core Shared Memory Machines , 2015, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[8]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[9]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10]  Jim Jeffers,et al.  High Performance Parallelism Pearls Volume Two: Multicore and Many-core Programming Approaches , 2015 .

[11]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[12]  Patrice Y. Simard,et al.  High Performance Convolutional Neural Networks for Document Processing , 2006 .

[13]  Yann LeCun,et al.  Fast Training of Convolutional Networks through FFTs , 2013, ICLR.

[14]  Fabian Tschopp,et al.  Efficient convolutional neural networks for pixelwise classification on heterogeneous hardware systems , 2015, 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI).

[15]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[16]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[17]  Jeff Johnson,et al.  Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.

[18]  Yann LeCun,et al.  Multi-Digit Recognition Using a Space Displacement Neural Network , 1991, NIPS.

[19]  Henry S. Warren,et al.  Hacker's Delight , 2002 .

[20]  Yann LeCun,et al.  Toward automatic phenotyping of developing embryos from videos , 2005, IEEE Transactions on Image Processing.

[21]  Jürgen Schmidhuber,et al.  A fast learning algorithm for image segmentation with max-pooling convolutional networks , 2013, 2013 IEEE International Conference on Image Processing.

[22]  Sven Behnke,et al.  Accelerating Large-Scale Convolutional Neural Networks with Parallel Graphics Multiprocessors , 2010, ICANN.

[23]  Nir Shavit,et al.  The big data challenges of connectomics , 2014, Nature Neuroscience.

[24]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.