Automating Cluster Management with Weave

Modern cluster management systems like Kubernetes and Openstack grapple with hard combinatorial optimization problems: load balancing, placement, scheduling, and configuration. Currently, developers tackle these problems by designing custom application-specific algorithms---an approach that is proving unsustainable, as ad-hoc solutions both perform poorly and introduce overwhelming complexity to the system, making it challenging to add important new features. We propose a radically different architecture, where programmers drive cluster management tasks declaratively, using SQL queries over cluster state stored in a relational database. These queries capture in a natural way both constraints on the cluster configuration as well as optimization objectives. When a cluster reconfiguration is required at runtime, our tool, called Weave, synthesizes an encoding of these queries into an optimization model, which it solves using an off-the-shelf solver. We demonstrate Weave's efficacy by powering three production-grade systems with it: a Kubernetes scheduler, a virtual machine management solution, and a distributed transactional datastore. Using Weave, we expressed complex cluster management policies in under 20 lines of SQL, easily added new features to these existing systems, and significantly improved placement quality and convergence times.

[1]  Hamid Pirahesh,et al.  Extending XQuery for analytics , 2005, SIGMOD '05.

[2]  Arie M. C. A. Koster,et al.  Towards robust network design using integer linear programming techniques , 2010, 6th EURO-NGI Conference on Next Generation Internet.

[3]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[4]  Saikat Guha,et al.  Generalized resource allocation for the cloud , 2012, SoCC '12.

[5]  Dahlia Malkhi,et al.  CORFU: A Shared Log Design for Flash Clusters , 2012, NSDI.

[6]  Mor Harchol-Balter,et al.  TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters , 2016, EuroSys.

[7]  Peter J. Stuckey,et al.  MiniZinc: Towards a Standard CP Modelling Language , 2007, CP.

[8]  Thomas Heinis,et al.  Just-In-Time Data Virtualization: Lightweight Data Management with ViDa , 2015, CIDR.

[9]  Alberto Caprara,et al.  Improved approximation algorithms for multidimensional bin packing problems , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[10]  Torsten Grust Monoid Comprehensions as a Target for the Translation of OQL , 1996, Grundlagen von Datenbanken.

[11]  Susanne Albers,et al.  Average-case analyses of first fit and random fit bin packing , 2000, SODA '98.

[12]  Ajay Gulati VMware distributed resource Management : design , Implementation , and lessons learned , 2022 .

[13]  David Maier,et al.  Optimizing object queries using an effective calculus , 2000, TODS.

[14]  Andrew D. Gordon,et al.  A Declarative Approach to Automated Configuration , 2012, LISA.

[15]  Navendu Jain,et al.  Managing cost, performance, and reliability tradeoffs for energy-aware server provisioning , 2011, 2011 Proceedings IEEE INFOCOM.

[16]  Xavier Lorca,et al.  Entropy: a consolidation manager for clusters , 2009, VEE '09.

[17]  Wouter Joosen,et al.  PoDIM: A Language for High-Level Configuration Management , 2007, LISA.

[18]  Sharad Malik,et al.  Declarative Infrastructure Configuration Synthesis and Debugging , 2008, Journal of Network and Systems Management.

[19]  Laurent Vanbever,et al.  Network-Wide Configuration Synthesis , 2016, CAV.

[20]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[21]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[22]  Ratul Mahajan,et al.  Network configuration synthesis with abstract topologies , 2017, PLDI.

[23]  Brighten Godfrey,et al.  Ravel: A Database-Defined Network , 2016, SOSR.

[24]  Idit Keidar,et al.  Omid, Reloaded: Scalable and Highly-Available Transaction Processing , 2017, FAST.

[25]  Sanjai Narain,et al.  Network Configuration Management via Model Finding , 2005, LISA.

[26]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[27]  Laurent Vanbever,et al.  NetComplete: Practical Network-Wide Configuration Synthesis with Autocompletion , 2018, NSDI.

[28]  Robert N. M. Watson,et al.  Firmament: Fast, Centralized Cluster Scheduling at Scale , 2016, OSDI.

[29]  Arjun Singh,et al.  A practical algorithm for balancing the max-min fairness and throughput objectives in traffic engineering , 2012, 2012 Proceedings IEEE INFOCOM.

[30]  Anastasia Ailamaki,et al.  Fast Queries Over Heterogeneous Data Through Engine Customization , 2016, Proc. VLDB Endow..

[31]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[32]  Ionel Corneliu Gog,et al.  Flexible and efficient computation in large data centres , 2018 .