Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms

Recent advances in neuroscientific understanding make parallel computing devices modeled after the human neocortex a plausible, attractive, fault-tolerant, and energy-efficient possibility. Such attributes have once again sparked an interest in creating learning algorithms that aspire to reverse-engineer many of the abilities of the brain. In this paper we describe a GPGPU-accelerated extension to an intelligent learning model inspired by the structural and functional properties of the mammalian neocortex. Our cortical network, like the brain, exhibits massive amounts of processing parallelism, making today's GPGPUs a highly attractive and readily-available hardware accelerator for such a model. Furthermore, we consider two inefficiencies inherent to our initial design: multiple kernel-launch overhead and poor utilization of GPGPU resources. We propose optimizations such as a software work-queue structure and pipelining the hierarchical layers of the cortical network to mitigate such problems. Our analysis provides important insight into the GPU architecture details including the number of cores, the memory system, and the global thread scheduler. Additionally, we create a runtime profiling tool for our parallel learning algorithm which proportionally distributes the cortical network across the host CPU as well as multiple GPUs, whether homogeneous or heterogeneous, that may be available to the system. Using the profiling tool with these optimizations on Nvidia's CUDA framework, we achieve up to 60x speedup over a single-threaded CPU implementation of the model.

[1]  J. Hawkins,et al.  On Intelligence , 2004 .

[2]  Mikko H. Lipasti,et al.  Cortical architectures on a GPGPU , 2010, GPGPU-3.

[3]  Keechul Jung,et al.  Neural Network Implementation Using CUDA and OpenMP , 2008, 2008 Digital Image Computing: Techniques and Applications.

[4]  Mikko H. Lipasti,et al.  A case for neuromorphic ISAs , 2011, ASPLOS XVI.

[5]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[6]  Gregory Diamos,et al.  Harmony: an execution model and runtime for heterogeneous many core systems , 2008, HPDC '08.

[7]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[8]  C. Koch,et al.  Recurrent excitation in neocortical circuits , 1995, Science.

[9]  A. Sillito,et al.  Always returning: feedback and sensory processing in visual cortex and thalamus , 2006, Trends in Neurosciences.

[10]  David R. Kaeli,et al.  Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[11]  Nikil D. Dutt,et al.  A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors , 2009, Neural Networks.

[12]  M. Alexander,et al.  Principles of Neural Science , 1981 .

[13]  D. Hubel,et al.  Receptive fields and functional architecture of monkey striate cortex , 1968, The Journal of physiology.

[14]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[15]  V. Mountcastle The columnar organization of the neocortex. , 1997, Brain : a journal of neurology.

[16]  Mikko H. Lipasti,et al.  Cortical columns: Building blocks for intelligent systems , 2009, 2009 IEEE Symposium on Computational Intelligence for Multimedia Signal and Vision Processing.

[17]  Tarek M. Taha,et al.  Scaling analysis of a neocortex inspired cognitive model on the Cray XD1 , 2008, The Journal of Supercomputing.

[18]  Mikko H. Lipasti,et al.  Discovering Cortical Algorithms , 2018, IJCCI.

[19]  J. Magee,et al.  Integrative Properties of Radial Oblique Dendrites in Hippocampal CA1 Pyramidal Neurons , 2006, Neuron.

[20]  P. Milner,et al.  The legacy of Donald O. Hebb: more than the Hebb Synapse , 2003, Nature Reviews Neuroscience.

[21]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[22]  Rajat Raina,et al.  Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.

[23]  S. Edelman,et al.  Human Brain Mapping 6:316–328(1998) � A Sequence of Object-Processing Stages Revealed by fMRI in the Human Occipital Lobe , 2022 .

[24]  Nikil D. Dutt,et al.  Efficient simulation of large-scale Spiking Neural Networks using CUDA graphics processors , 2009, 2009 International Joint Conference on Neural Networks.

[25]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[26]  Jack Dongarra,et al.  Faster, Cheaper, Better { a Hybridization Methodology to Develop Linear Algebra Software for GPUs , 2010 .

[27]  Larry W. Swanson,et al.  Mapping the human brain: past, present, and future , 1995, Trends in Neurosciences.

[28]  Dario L Ringach,et al.  Haphazard wiring of simple receptive fields and orientation columns in visual cortex. , 2004, Journal of neurophysiology.

[29]  V. Mountcastle,et al.  An organizing principle for cerebral function : the unit module and the distributed system , 1978 .

[30]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[31]  G. Roth,et al.  Evolution of the brain and intelligence , 2005, Trends in Cognitive Sciences.