Comparison of neuronal spike exchange methods on a Blue Gene/P supercomputer

For neural network simulations on parallel machines, interprocessor spike communication can be a significant portion of the total simulation time. The performance of several spike exchange methods using a Blue Gene/P (BG/P) supercomputer has been tested with 8–128 K cores using randomly connected networks of up to 32 M cells with 1 k connections per cell and 4 M cells with 10 k connections per cell, i.e., on the order of 4·1010 connections (K is 1024, M is 10242, and k is 1000). The spike exchange methods used are the standard Message Passing Interface (MPI) collective, MPI_Allgather, and several variants of the non-blocking Multisend method either implemented via non-blocking MPI_Isend, or exploiting the possibility of very low overhead direct memory access (DMA) communication available on the BG/P. In all cases, the worst performing method was that using MPI_Isend due to the high overhead of initiating a spike communication. The two best performing methods—the persistent Multisend method using the Record-Replay feature of the Deep Computing Messaging Framework DCMF_Multicast; and a two-phase multisend in which a DCMF_Multicast is used to first send to a subset of phase one destination cores, which then pass it on to their subset of phase two destination cores—had similar performance with very low overhead for the initiation of spike communication. Departure from ideal scaling for the Multisend methods is almost completely due to load imbalance caused by the large variation in number of cells that fire on each processor in the interval between synchronization. Spike exchange time itself is negligible since transmission overlaps with computation and is handled by a DMA controller. We conclude that ideal performance scaling will be ultimately limited by imbalance between incoming processor spikes between synchronization intervals. Thus, counterintuitively, maximization of load balance requires that the distribution of cells on processors should not reflect neural net architecture but be randomly distributed so that sets of cells which are burst firing together should be on different processors with their targets on as large a set of processors as possible.

[1]  H. Markram The Blue Brain Project , 2006, Nature Reviews Neuroscience.

[2]  Frederick C. Harris,et al.  Implementation of a Biologically Realistic Parallel Neocortical-Neural Network Simulator , 2001, PPSC.

[3]  Philip Heidelberger,et al.  Optimization of applications with non-blocking neighborhood collectives via multisends on the Blue Gene/P supercomputer , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[4]  Rick Stevens,et al.  Large Neural Simulations on Large Parallel Computers , 2005 .

[5]  Marc-Oliver Gewaltig,et al.  Multithreaded and Distributed Simulation of Large Biological Neuronal Networks , 2007, PVM/MPI.

[6]  Dharmendra S. Modha,et al.  Anatomy of a cortical simulator , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[7]  Cho-Li Wang,et al.  Efficient Scheduling of Complete Exchange on Clusters , 2000 .

[8]  Örjan Ekeberg,et al.  Run-Time Interoperability Between Neuronal Network Simulators Based on the MUSIC Framework , 2010, Neuroinformatics.

[9]  Örjan Ekeberg,et al.  Massively parallel simulation of brain-scale neuronal network models , 2005 .

[10]  Michael L. Hines,et al.  Parallel network simulations with NEURON , 2006, Journal of Computational Neuroscience.

[11]  Dharmendra S. Modha,et al.  The cat is out of the bag: cortical simulations with 109 neurons, 1013 synapses , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[12]  Nicholas T. Carnevale,et al.  The NEURON Simulation Environment , 1997, Neural Computation.

[13]  Markus Diesmann,et al.  Simulating macroscale brain circuits with microscale resolution , 1970 .

[14]  Philip Heidelberger,et al.  The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer , 2008, ICS '08.