Gridiron: A Technique for Augmenting Cloud Workloads with Network Bandwidth Requirements

Cloud applications use more than just server resources, they also require networking resources. We propose a new technique to model network bandwidth demand of networked cloud applications. Our technique, Gridiron, augments VM workload traces from Azure cloud with network bandwidth requirements. The key to the Gridiron technique is to derive inter-VM network bandwidth requirements using Amdahl’s second law. As a case study, we use Gridiron to generate realistic traces with network bandwidth demands for a distributed machine learning training application. Workloads generated with Gridiron allow datacenter operators to estimate the network bandwidth demands of cloud applications and enable more realistic cloud resource scheduler evaluation.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  J. Christopher Beck,et al.  Generating Complex, Realistic Cloud Workloads using Recurrent Neural Networks , 2021, SOSP.

[3]  Gene M. Amdahl,et al.  Computer Architecture and Amdahl's Law , 2007, Computer.

[4]  Ricardo Bianchini,et al.  Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms , 2017, SOSP.

[5]  Yifei Yuan,et al.  On the feasibility of automation for bandwidth allocation problems in data centers , 2013, 2013 Formal Methods in Computer-Aided Design.

[6]  Tian Zhou,et al.  DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving , 2020, WSDM.

[7]  Harsh Chawla,et al.  Azure Kubernetes Service , 2019 .

[8]  David Patterson,et al.  MLPerf Training Benchmark , 2019, MLSys.

[9]  Helen J. Wang,et al.  SecondNet: a data center network virtualization architecture with bandwidth guarantees , 2010, CoNEXT.

[10]  Kang G. Shin,et al.  Tiresias: A GPU Cluster Manager for Distributed Deep Learning , 2019, NSDI.

[11]  Sangeetha Abdu Jyothi,et al.  TicTac: Accelerating Distributed Deep Learning with Communication Scheduling , 2018, MLSys.

[12]  Zhibin Yu,et al.  The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: a View from Alibaba Trace , 2018, SoCC.

[13]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[14]  Victor I. Chang,et al.  The efficient framework and algorithm for provisioning evolving VDC in federated data centers , 2017, Future Gener. Comput. Syst..

[15]  Kai Chen,et al.  Training Deep Bidirectional LSTM Acoustic Model for LVCSR by a Context-Sensitive-Chunk BPTT Approach , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Lin Li,et al.  Towards Robust Green Virtual Cloud Data Center Provisioning , 2017, IEEE Transactions on Cloud Computing.

[17]  Gennady Pekhimenko,et al.  Priority-based Parameter Propagation for Distributed DNN Training , 2019, SysML.

[18]  Wei Wang,et al.  Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud , 2019, SoCC.

[19]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Terry Clark,et al.  Parallel Computing , 2017, Encyclopedia of GIS.

[22]  Ivan Beschastnikh,et al.  Scalable Constraint-based Virtual Data Center Allocation , 2017, IJCAI.

[23]  Mor Harchol-Balter,et al.  Borg: the next generation , 2020, EuroSys.

[24]  Ricardo Bianchini,et al.  Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider , 2020, USENIX ATC.

[25]  A. Rowstron,et al.  Towards predictable datacenter networks , 2011, SIGCOMM.

[26]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[27]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[28]  Albert G. Greenberg,et al.  VL2: a scalable and flexible data center network , 2009, SIGCOMM '09.

[29]  Panos Kalnis,et al.  Scaling Distributed Machine Learning with In-Network Aggregation , 2019, NSDI.

[30]  Chen Feng,et al.  Performance Characterization of Hadoop and Data MPI Based on Amdahl's Second Law , 2014, 2014 9th IEEE International Conference on Networking, Architecture, and Storage.

[31]  Yibo Zhu,et al.  A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[32]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[33]  Ahmed Amokrane,et al.  Greenhead: Virtual Data Center Embedding across Distributed Infrastructures , 2013, IEEE Transactions on Cloud Computing.

[34]  Minjae Kim,et al.  U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation , 2019, ICLR.

[35]  T. Moscibroda,et al.  Protean: VM Allocation Service at Scale , 2020, OSDI.

[36]  Tat-Seng Chua,et al.  Neural Collaborative Filtering , 2017, WWW.

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[38]  Wencong Xiao,et al.  Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads , 2019, USENIX Annual Technical Conference.