Kraken: Online and Elastic Resource Reservations for Cloud Datacenters

In cloud environments, the absence of strict network performance guarantees leads to unpredictable job execution times. To address this issue, recently, there have been several proposals on how to provide guaranteed network performance. These proposals, however, rely on computing resource reservation schedules a priori. Unfortunately, this is not practical in today’s cloud environments, where application demands are inherently unpredictable, e.g., due to differences in the input data sets or phenomena, such as failures and stragglers. To overcome these limitations, we designed Kraken, a system that allows to dynamically update minimum guarantees for both network bandwidth and compute resources at runtime. Unlike previous work, Kraken does not require prior knowledge about the resource needs of the applications but allows to modify reservations at runtime. Kraken achieves this through an online resource reservation scheme, which comes with provable optimality guarantees. In this paper, we motivate the need for dynamic resource reservation schemes, present how this is provided by Kraken, and evaluate Kraken via extensive simulations and a preliminary Hadoop prototype.

[1]  Antony I. T. Rowstron,et al.  Bridging the tenant-provider gap in cloud services , 2012, SoCC '12.

[2]  Stefan Schmid,et al.  Kraken: Online and elastic resource reservations for multi-tenant datacenters , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[3]  Hitesh Ballani,et al.  Towards predictable datacenter networks , 2011, SIGCOMM 2011.

[4]  Magdalena Balazinska,et al.  SkewTune in Action: Mitigating Skew in MapReduce Applications , 2012, Proc. VLDB Endow..

[5]  Ion Stoica,et al.  FairCloud: sharing the network in cloud computing , 2011, SIGCOMM '12.

[6]  Ning Ding,et al.  The only constant is change: incorporating time-varying network reservations in data centers , 2012, SIGCOMM.

[7]  Ahmed Karmouch,et al.  Resource Discovery and Allocation in Network Virtualization , 2012, IEEE Communications Surveys & Tutorials.

[8]  Sriram Ramabhadran,et al.  Cloud control with distributed rate limiting , 2007, SIGCOMM 2007.

[9]  Lucian Popa,et al.  What we talk about when we talk about cloud network performance , 2012, CCRV.

[10]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[11]  Ying Zhang,et al.  Providing bandwidth guarantees, work conservation and low latency simultaneously in the cloud , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[12]  Hari Balakrishnan,et al.  Cicada: Introducing Predictive Guarantees for Cloud Networks , 2014, HotCloud.

[13]  Albert G. Greenberg,et al.  Seawall: Performance Isolation for Cloud Datacenter Networks , 2010, HotCloud.

[14]  Stefan Schmid,et al.  How Hard Can It Be?: Understanding the Complexity of Replica Aware Virtual Cluster Embeddings , 2015, 2015 IEEE 23rd International Conference on Network Protocols (ICNP).

[15]  Dorgival O. Guedes,et al.  Gatekeeper: Supporting Bandwidth Guarantees for Multi-tenant Datacenter Networks , 2011, WIOV.

[16]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[17]  Gianpaolo Oriolo,et al.  Hardness of robust network design , 2007 .

[18]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[19]  Paolo Costa,et al.  Exploiting Time-Malleability in Cloud-based Batch Processing Systems , 2013 .

[20]  Ion Stoica,et al.  True elasticity in multi-tenant data-intensive compute clusters , 2012, SoCC '12.

[21]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[22]  Alex C. Snoeren,et al.  Blender: Upgrading tenant-based data center networking , 2014, 2014 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[23]  Subhash Suri,et al.  Designing Least-Cost Nonblocking Broadband Networks , 1997, J. Algorithms.

[24]  Chen Liang,et al.  Participatory networking: an API for application control of SDNs , 2013, SIGCOMM.

[25]  Aman Kansal,et al.  Q-clouds: managing performance interference effects for QoS-aware clouds , 2010, EuroSys '10.

[26]  Sujata Banerjee,et al.  ElasticSwitch: practical work-conserving bandwidth guarantees for cloud computing , 2013, SIGCOMM.

[27]  Helen J. Wang,et al.  SecondNet: a data center network virtualization architecture with bandwidth guarantees , 2010, CoNEXT.

[28]  Matthias Rost,et al.  Beyond the Stars: Revisiting Virtual Cluster Embeddings , 2015, CCRV.

[29]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[30]  Cristina L. Abad,et al.  Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters , 2013, SoCC.

[31]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.