Scaling Up a Multispectral Resnet-50 to 128 GPUs

Similarly to other scientific domains, Deep Learning (DL) holds great promises to fulfil the challenging needs of Remote Sensing (RS) applications. However, the increase in volume, variety and complexity of acquisitions that are carried out on a daily basis by Earth Observation (EO) missions generates new processing and storage challenges within operational processing pipelines. The aim of this work is to show that High-Performance Computing (HPC) systems can speed up the training time of Convolutional Neural Networks (CNNs). Particular attention is put on the monitoring of the classification accuracy that usually degrades when using large batch sizes. The experimental results of this work show that the training of the model scales up to a batch size of 8,000, obtaining classification performances in terms of accuracy in line with those using smaller batch sizes.

[1]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[2]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[3]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[4]  Begüm Demir,et al.  Bigearthnet: A Large-Scale Benchmark Archive for Remote Sensing Image Understanding , 2019, IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium.

[5]  Jon Atli Benediktsson,et al.  Remote Sensing Big Data Classification with High Performance Distributed Deep Learning , 2019, Remote Sensing.

[6]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[7]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[8]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[9]  Dorian Krause,et al.  JURECA: Modular supercomputer at Jülich Supercomputing Centre , 2018, Journal of large-scale research facilities JLSRF.

[10]  Josef Aschbacher,et al.  ESA’s Earth Observation Strategy and Copernicus , 2017 .

[11]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.