Exploration of task-based scheduling for convolutional neural networks accelerators under memory constraints

Development of application specific accelerators for deep convolutional neural networks (ConvNets) have mainly focussed on accelerating the computationally intensive layers, that is the convolutional layers, to improve performance and energy efficiency. Traditional approaches in this space have relied on handcrafted dataflow implementations to leverage the fine-grained parallelism and data-locality properties within these layers. However, ConvNets layers also have an untapped potential from cross-layer data locality. In our work, we explore a novel approach in the context of deep neural networks accelerators by modelling the computation as a task-dependency directed acyclic graph and proposing a memory-aware heuristic based onHeterogeneous Earliest Finish Time (HEFT) for task-graph scheduling on shared memory systems. Our results show the benefits of task graphs in terms of better memory use (23.4 % less) over conventional layer-by-layer processing in a simulated environment with the first three layers of LeNet-5. Certain task-graphs trade-off makespan (10% increase) for memory use (20 % decrease). Finally, our exploration of graphs with different slicing configurations for the pooling layer while using memory-aware HEFT versus the original HEFT reveals that regular shaped tiles across layers offers better makespan and memory use than tiles with large dimensions along one axis.

[1]  Xuan Yang,et al.  A Systematic Approach to Blocking Convolutional Neural Networks , 2016, ArXiv.

[2]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[4]  Yves Robert,et al.  Memory-Aware List Scheduling for Hybrid Platforms , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[5]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[9]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[10]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[11]  Nicholas D. Lane,et al.  An Early Resource Characterization of Deep Learning on Wearables, Smartphones and Internet-of-Things Devices , 2015, IoT-App@SenSys.

[12]  Matthew Mattina,et al.  SCALE-Sim: Systolic CNN Accelerator , 2018, ArXiv.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Luca Benini,et al.  Optimal Tiling Strategy for Memory Bandwidth Reduction for CNNs , 2017, ACIVS.

[15]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[16]  Matthew Mattina,et al.  SCALE-Sim: Systolic CNN Accelerator , 2018, ArXiv.

[17]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[18]  Luca Benini,et al.  Optimally Scheduling CNN Convolutions for Efficient Memory Access , 2019, ArXiv.