On Realizing Distributed Deep Neural Networks: An Astrophysics Case Study

Deep Learning architectures are extensively adopted as the core machine learning framework in both industry and academia. With large amounts of data at their disposal, these architectures can autonomously extract highly descriptive features for any type of input signals. However, the extensive volume of data combined with the demand for high computational resources, are introducing new challenges in terms of computing platforms. The work herein presented explores the performance of Deep Learning in the field of astrophysics, when conducted on a distributed environment. To set up such an environment, we capitalize on TensorFlowOnSpark, which combines both TensorFlow’s dataflow graphs and Spark’s cluster management. We report on the performance of a CPU cluster, considering both the number of training nodes and data distribution, while quantifying their effects via the metrics of training accuracy and training loss. Our results indicate that distribution has a positive impact on Deep Learning, since it accelerates our network’s convergence for a given number of epochs. However, network traffic adds a significant amount of overhead, rendering it suitable for mostly very deep models or in big Data Analytics.

[1]  Jing Yang,et al.  3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition , 2018, IEEE Signal Processing Letters.

[2]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[3]  Zhao Zhang,et al.  Scientific computing meets big data technology: An astronomy use case , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[4]  Jean-Luc Starck,et al.  Convolutional Neural Networks for Spectroscopic Redshift Estimation on Euclid Data , 2018, IEEE Transactions on Big Data.

[5]  Satoshi Matsuoka,et al.  Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[6]  Ji Liu,et al.  Staleness-Aware Async-SGD for Distributed Deep Learning , 2015, IJCAI.

[7]  Shin-Jye Lee,et al.  Image Classification Based on the Boost Convolutional Neural Network , 2018, IEEE Access.

[8]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[9]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[10]  Reynold Xin,et al.  Apache Spark , 2016 .

[11]  Dhabaleswar K. Panda,et al.  DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters , 2018, IEEE Transactions on Multi-Scale Computing Systems.

[12]  Jason Rhodes,et al.  Scientific Synergy between LSST and Euclid , 2017, The Astrophysical Journal Supplement Series.

[13]  Tong Zhang,et al.  Deep Learning Based Feature Selection for Remote Sensing Scene Classification , 2015, IEEE Geoscience and Remote Sensing Letters.

[14]  Wenting Han,et al.  Improving the Performance of Distributed TensorFlow with RDMA , 2017, International Journal of Parallel Programming.

[15]  Seunghak Lee,et al.  On Model Parallelization and Scheduling Strategies for Distributed Machine Learning , 2014, NIPS.

[16]  Farhan Feroz,et al.  SKYNET: an efficient and robust neural network training tool for machine learning in astronomy , 2013, ArXiv.

[17]  Abdellah Ezzati,et al.  A comparative between hadoop mapreduce and apache Spark on HDFS , 2017, IML.

[18]  Klaus Kofler,et al.  Performance and Scalability of GPU-Based Convolutional Neural Networks , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.