The Potential of the Intel (R) Xeon Phi for Supervised Deep Learning

Supervised learning of Convolutional Neural Networks (CNNs), also known as supervised Deep Learning, is a computationally demanding process. To find the most suitable parameters of a network for a given application, numerous training sessions are required. Therefore, reducing the training time per session is essential to fully utilize CNNs in practice. While numerous research groups have addressed the training of CNNs using GPUs, so far not much attention has been paid to the Intel Xeon Phi coprocessor. In this paper we investigate empirically and theoretically the potential of the Intel Xeon Phi for supervised learning of CNNs. We design and implement a parallelization scheme named CHAOS that exploits both the thread-and SIMD-parallelism of the coprocessor. Our approach is evaluated on the Intel Xeon Phi 7120P using the MNIST dataset of handwritten digits for various thread counts and CNN architectures. Results show a 103.5x speed up when training our large network for 15 epochs using 244 threads, compared to one thread on the coprocessor. Moreover, we develop a performance model and use it to assess our implementation and answer what-if questions.

[1]  T. Fahringer,et al.  On Customizing the UML for Modeling Performance-Oriented Applications , 2002, UML.

[2]  Thomas Fahringer,et al.  Teuta: Tool Support for Performance Modeling of Distributed and Parallel Applications , 2004, International Conference on Computational Science.

[3]  Patrice Y. Simard,et al.  High Performance Convolutional Neural Networks for Document Processing , 2006 .

[4]  Fatos Xhafa,et al.  Towards an Intelligent Environment for Programming Multi-core Computing Systems , 2009, Euro-Par Workshops.

[5]  Ivona Brandic,et al.  A Survey of the State of the Art in Performance Modeling and Prediction of Parallel and Distributed Computing Systems , 2008 .

[6]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[7]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[8]  Luca Maria Gambardella,et al.  High-Performance Neural Networks for Visual Object Classification , 2011, ArXiv.

[9]  Cédric Augonnet,et al.  PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems , 2011, IEEE Micro.

[10]  Andrew Richards,et al.  Programmability and performance portability aspects of heterogeneous multi-/manycore systems , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[11]  Jürgen Schmidhuber,et al.  Multi-column deep neural network for traffic sign classification , 2012, Neural Networks.

[12]  Siegfried Benkner,et al.  Using explicit platform descriptions to support programming of heterogeneous many-core systems , 2012, Parallel Comput..

[13]  Siegfried Benkner,et al.  High-level Support for Hybrid Parallel Execution of C++ Applications Targeting Intel® Xeon Phi™ Coprocessors , 2013, ICCS.

[14]  Qinru Qiu,et al.  Accelerating pattern matching in neuromorphic text recognition system using Intel Xeon Phi coprocessor , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[15]  Rong Gu,et al.  Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[16]  Shuaiwen Song,et al.  MIC-SVM: Designing a Highly Efficient Support Vector Machine for Advanced Modern Multi-core and Many-Core Architectures , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[17]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[18]  Mingyue Ding,et al.  Deep learning based classification of focal liver lesions with contrast-enhanced ultrasound , 2014 .

[19]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..