Parallel asynchronous matrix multiplications for a distributed pipelined neural network

Machine learning is an approach to devise algorithms that compute an output without a given rule set but based on a self-learning concept. This approach is of great importance for several fields of applications in science and industry where traditional programming methods are not sufficient. In neural networks, a popular subclass of machine learning algorithms, commonly previous experience is used to train the network and produce good outputs for newly introduced inputs. By increasing the size of the network more complex problems can be solved which again rely on a huge amount of training data. Increasing the complexity also leads to higher computational demand and storage requirements and to the need for parallelization. Several parallelization approaches of neural networks have already been considered. Most approaches use special purpose hardware whilst other work focuses on using standard hardware. Often these approaches target the problem by parallelizing the training data. In this work a new parallelization method named poadSGD is proposed for the parallelization of fully-connected, largescale feedforward networks on a compute cluster with standard hardware. poadSGD is based on the stochastic gradient descent algorithm. A block-wise distribution of the network's layers to groups of processes and a pipelining scheme for batches of the training samples are used. The network is updated asynchronously without interrupting ongoing computations of subsequent batches. For this task a one-sided communication scheme is used. A main algorithmic part of the batch-wise pipelined version consists of matrix multiplications which occur for a special distributed setup, where each matrix is held by a different process group. GASPI, a parallel programming model from the field of "Partitioned Global Address Spaces" (PGAS) models is introduced and compared to other models from this class. As it mainly relies on one-sided and asynchronous communication it is a perfect candidate for the asynchronous update task in the poadSGD algorithm. Therefore, the matrix multiplication is also implemented based GASPI. In order to efficiently handle upcoming synchronizations within the process groups and achieve a good workload distribution, a two-dimensional block-cyclic data distribution is applied for the matrices. Based on this distribution, the multiplication algorithm is computed by diagonally iterating over the sub blocks of the resulting matrix and computing the sub blocks in subgroups of the processes. The sub blocks are computed by sharing the workload between the process groups and communicating mostly in pairs or in subgroups. The communication in pairs is set up to be overlapped by other ongoing computations. The implementations provide a special challenge, since the asynchronous communication routines must be handled with care as to which processor is working at what point in time with which data in order to prevent an unintentional dual use of data. The theoretical analysis shows the matrix multiplication to be superior to a naive implementation when the dimension of the sub blocks of the matrices exceeds 382. The performance achieved in the test runs did not withstand the expectations the theoretical analysis predicted. The algorithm is executed on up to 512 cores and for matrices up to a size of 131,072 x 131,072. The implementation using the GASPI API was found not be straightforward but to provide a good potential for overlapping communication with computations whenever the data dependencies of an application allow for it. The matrix multiplication was successfully implemented and can be used within an implementation of the poadSGD method that is yet to come. The poadSGD method seems to be very promising, especially as nowadays, with the larger amount of data and the increased complexity of the applications, the approaches to parallelization of neural networks are increasingly of interest.

[1]  Rui Machado,et al.  Unbalanced tree search on a manycore system using the GPI programming model , 2011, Computer Science - Research and Development.

[2]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[3]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[4]  Sreedhar B. Kodali,et al.  The Asynchronous Partitioned Global Address Space Model , 2010 .

[5]  Ralph Duncan,et al.  A survey of parallel computer architectures , 1990, Computer.

[6]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[7]  Pawan Kumar Communication Optimal Least Squares Solver , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[8]  Janis Keuper,et al.  Asynchronous parallel stochastic gradient descent: a numeric core for scalable distributed machine learning algorithms , 2015, MLHPC@SC.

[9]  Robert A. van de Geijn,et al.  SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .

[10]  C. Simmendinger,et al.  The GASPI API specification and its implementation GPI 2.0 , 2013 .

[11]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[12]  Johannes Schemmel,et al.  A wafer-scale neuromorphic hardware system for large-scale neural modeling , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[13]  Ramesh C. Agarwal,et al.  A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication , 1994, IBM J. Res. Dev..

[14]  Pavan Balaji,et al.  Portable, MPI-interoperable coarray fortran , 2014, PPoPP '14.

[15]  Amparo Alonso-Betanzos,et al.  A Very Fast Learning Method for Neural Networks Based on Sensitivity Analysis , 2006, J. Mach. Learn. Res..

[16]  Jarle Berntsen,et al.  Communication efficient matrix multiplication on hypercubes , 1989, Parallel Comput..

[17]  Yves Robert,et al.  Matrix Multiplication on Heterogeneous Platforms , 2001, IEEE Trans. Parallel Distributed Syst..

[18]  John Reid,et al.  Coarrays in the next Fortran standard , 2010, FORF.

[19]  Katherine Yelick,et al.  Appendix B: UPC Collective Operations Specifications, v1.0 , 2005 .

[20]  Zhang Zhang,et al.  A UPC runtime system based on MPI and POSIX threads , 2006, 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP'06).

[21]  Jane Sleightholme,et al.  Compiler support for the Fortran 2003 standard , 2007, FORF.

[22]  Jens Jägersküpper,et al.  A PGAS-based Implementation for the Unstructured CFD Solver TAU , 2011 .

[23]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[24]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[25]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[26]  Rupak Majumdar,et al.  A Theory of Partitioned Global Address Spaces , 2013, FSTTCS.

[27]  Charu C. Aggarwal,et al.  Neural Networks and Deep Learning , 2018, Springer International Publishing.

[28]  William N. Scherer,et al.  A new vision for coarray Fortran , 2009, PGAS '09.

[29]  Victor Luchangco,et al.  The Fortress Language Specification Version 1.0 , 2007 .

[30]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[31]  Gerhard Wellein,et al.  PGAS implementation of SpMVM and LBM using GPI , 2013 .

[32]  Naoyuki Fukuda,et al.  Massively parallel architectures for large scale neural network simulations , 1992, IEEE Trans. Neural Networks.

[33]  Vincent Heuveline,et al.  Evaluation of the Global Address Space Programming Interface (GASPI) , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[34]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[35]  David E. Bernholdt,et al.  Exploring HPCS languages in scientific computing , 2008 .

[36]  Holger Fröning,et al.  Energy-Efficient Stencil Computations on Distributed GPUs Using Dynamic Parallelism and GPU-Controlled Communication , 2014, 2014 Energy Efficient Supercomputing Workshop.

[37]  Bertil Svensson,et al.  Using and Designing Massively Parallel Computers for Artificial Neural Neural Networks , 1992, J. Parallel Distributed Comput..

[38]  Yifeng Chen,et al.  Auto-tuning Dense Matrix Multiplication for GPGPU with Cache , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[39]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[40]  Hans P. Zima,et al.  The cascade high productivity language , 2004 .

[41]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[42]  Mohak Shah,et al.  Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning , 2015, ArXiv.

[43]  Hans Werner Pohl,et al.  Flexible data parallel training of neural networks using MIMD-Computers , 1995, Proceedings Euromicro Workshop on Parallel and Distributed Processing.

[44]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[45]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[46]  Rajeev Thakur,et al.  UPC-IO: A Parallel I/O API for UPC , 2003 .

[47]  Jaeyoung Choi A new parallel matrix multiplication algorithm on distributed-memory concurrent computers , 1998, Concurr. Pract. Exp..

[48]  Arthur L. Samuel,et al.  Some studies in machine learning using the game of checkers , 2000, IBM J. Res. Dev..

[49]  Daniel Grünewald BQCD with GPI: A case study , 2012, 2012 International Conference on High Performance Computing & Simulation (HPCS).

[50]  Rui Machado,et al.  The Fraunhofer virtual machine: a communication library and runtime system based on the RDMA model , 2009, Computer Science - Research and Development.

[51]  Michael J. Flynn,et al.  Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.

[52]  Geoffrey C. Fox,et al.  Matrix algorithms on a hypercube I: Matrix multiplication , 1987, Parallel Comput..

[53]  Jaeyoung Choi,et al.  Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers , 1994, Concurr. Pract. Exp..

[54]  P. Werstein,et al.  Parallelization of a Backpropagation Neural Network on a Cluster Computer , 2022 .

[55]  Jarek Nieplocha,et al.  SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[56]  Lena Oden,et al.  GPI2 for GPUs: A PGAS framework for efficient communication in hybrid clusters , 2013, PARCO.

[57]  Anthony Skjellum,et al.  A poly‐algorithm for parallel dense matrix multiplication on two‐dimensional process grid topologies , 1997 .

[58]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[59]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[60]  Götz Alefeld,et al.  Parallele numerische Verfahren , 2002 .

[61]  George Dahl,et al.  Parallelizing neural network training for cluster systems , 2008 .

[62]  Udo Seiffert,et al.  Artificial Neural Networks on Massively Parallel Computer Hardware , 2004, ESANN.

[63]  Gerhard Wellein,et al.  Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.

[64]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[65]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[66]  D. Skillicorn Strategies for Parallelizing Supervised and Unsupervised Learning in Arti cial Neural Networks Using the BSP Cost Model , 1997 .

[67]  Jens Breitbart,et al.  Dataflow-like Synchronization in a PGAS Programming Model , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[68]  Fernando Doglio When to Scale , 2018 .

[69]  Thomas Rauber,et al.  Parallel Programming: for Multicore and Cluster Systems , 2010, Parallel Programming, 3rd Ed..

[70]  Mirko Rahn,et al.  The GASPI API: A Failure Tolerant PGAS API for Asynchronous Dataflow on Heterogeneous Architectures , 2015 .

[71]  S. Huss-Lederman,et al.  Comparison of scalable parallel matrix multiplication libraries , 1993, Proceedings of Scalable Parallel Libraries Conference.

[72]  Barbara M. Chapman,et al.  John von Neumann Institute for Computing Enhancing OpenMP and Its Implementation for Programming Multicore Systems , 2008 .

[73]  David Padua,et al.  Encyclopedia of Parallel Computing , 2011 .