Infrastructure-Aware TensorFlow for Heterogeneous Datacenters

Heterogeneous datacenters, with a variety of compute, memory, and network resources, are becoming increasingly popular to address the resource requirements of time-sensitive applications. One such application framework is the TensorFlow platform, which has become a platform of choice for running machine learning workloads. The state-of-the-art TensorFlow platform is oblivious to the availability and performance profiles of the underlying datacenter resources and does not incorporate resource requirements of the given workloads for distributed training. This leads to executing the training tasks on busy and resource-constrained worker nodes, which results in a significant increase in the overall training time. In this paper, we address this challenge and propose architectural improvements and new software modules in the default TensorFlow platform to make it aware of the availability and capabilities of the underlying datacenter resources. The proposed Infrastructure-Aware Tensor-Flow efficiently schedules the training tasks on the best possible resources for execution and reduces the overall training time. Our evaluation using the worker nodes with varying availability and performance profiles shows that the proposed enhancements yield up to 54 % reduced training time as compared to the default TensorFlow platform.

[1]  Michael John Sebastian Smith,et al.  Application-specific integrated circuits , 1997 .

[2]  Erik Nijkamp,et al.  Deep Learning With TensorFlow: A Review , 2019, Journal of Educational and Behavioral Statistics.

[3]  Xi Li,et al.  Accelerating Distributed Training in Heterogeneous Clusters via a Straggler-Aware Parameter Server , 2019, 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[4]  Mahmut T. Kandemir,et al.  Phoenix: A Constraint-Aware Scheduler for Heterogeneous Datacenters , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[5]  Timothy Wood,et al.  Benefits and challenges of managing heterogeneous data centers , 2013, 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013).

[6]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[7]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[8]  Quoc V. Le,et al.  A Hierarchical Model for Device Placement , 2018, ICLR.

[9]  Michael J. Freedman,et al.  Resource Elasticity in Distributed Deep Learning , 2020, MLSys.

[10]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[11]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[12]  S. Hemminger Network Emulation with NetEm , 2022 .

[13]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ali R. Butt,et al.  MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems , 2020, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID).

[15]  Joseph Manzano,et al.  User-transparent Distributed TensorFlow , 2017, ArXiv.

[16]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[17]  Marco Pavone,et al.  A machine learning approach for real-time reachability analysis , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[19]  Wolfgang Barth,et al.  Nagios: System and Network Monitoring , 2006 .

[20]  Marco Zanetti,et al.  Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics , 2019, ArXiv.

[21]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[22]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[23]  Simeng Liu,et al.  tensorflow-tracing: A Performance Tuning Framework for Production , 2019, OpML.

[24]  Wei-Hua Bai,et al.  Performance Analysis of Heterogeneous Data Centers in Cloud Computing Using a Complex Queuing Model , 2015 .

[25]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[26]  Theocharis Theocharides,et al.  Edge Intelligence: Challenges and Opportunities of Near-Sensor Machine Learning Applications , 2018, 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[27]  Lingjia Tang,et al.  Heterogeneity in “Homogeneous” Warehouse-Scale Computers: A Performance Opportunity , 2011, IEEE Computer Architecture Letters.

[28]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Eric P. Xing,et al.  Addressing the straggler problem for iterative convergent parallel ML , 2016, SoCC.

[31]  David R. Kaeli,et al.  Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems , 2013, GPGPU@ASPLOS.