Large-scale cluster management at Google with Borg

Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior. We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.

[1]  W. Whitt,et al.  Open and closed models for networks of queues , 1984, AT&T Bell Laboratories Technical Journal.

[2]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[3]  Mary Baker,et al.  Availability in the Sprite distributed file system , 1991, OPSR.

[4]  Rajesh Raman,et al.  Matchmaking: distributed resource management for high throughput computing , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[5]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[6]  Baruch Awerbuch,et al.  An Opportunity Cost Approach for Job Assignment in a Scalable Computing Cluster , 2000, IEEE Trans. Parallel Distributed Syst..

[7]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[8]  Baruch Awerbuch,et al.  An Opportunity Cost Approach for Job Assignment and Reassignment in a Scalable Computing Cluster , 2002 .

[9]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[10]  GhemawatSanjay,et al.  The Google file system , 2003 .

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[13]  Michael Isard,et al.  Autopilot: automatic data center management , 2007, OPSR.

[14]  James R. Hamilton,et al.  On Designing and Deploying Internet-Scale Services , 2007, LISA.

[15]  刘锋,et al.  Kernel-based virtual machine事件跟踪机制的设计与实现 , 2008 .

[16]  M. Korupolu,et al.  Server-storage virtualization: Integration and load balancing in data centers , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[18]  Dror G. Feitelson,et al.  On Simulation and Design of Parallel-Systems Schedulers: Are We Doing the Right Thing? , 2009, IEEE Transactions on Parallel and Distributed Systems.

[19]  A. Zahariev Google App Engine , 2009 .

[20]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[21]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[22]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[23]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[24]  Chita R. Das,et al.  Towards characterizing cloud backend workloads: insights from Google compute clusters , 2010, PERV.

[25]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[26]  Paul Turner,et al.  CPU bandwidth control for CFS , 2010 .

[27]  Rajeev Gandhi,et al.  An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[28]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[29]  Chita R. Das,et al.  Modeling and synthesizing task placement constraints in Google compute clusters , 2011, SoCC.

[30]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[31]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[32]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]  Raouf Boutaba,et al.  Characterizing Task Usage Shapes in Google Compute Clusters , 2011 .

[34]  Nathan Linial,et al.  No justified complaints: on fair sharing of multiple resources , 2011, ITCS '12.

[35]  Yanpei Chen,et al.  Design Insights for MapReduce from Diverse Production Workloads , 2012 .

[36]  Sheng Di,et al.  Characterization and Comparison of Cloud versus Grid Workloads , 2012, 2012 IEEE International Conference on Cluster Computing.

[37]  Jerome A. Rolia,et al.  Selling T-shirts and Time Shares in the Cloud , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[38]  Sangyeun Cho,et al.  Characterizing Machines and Workloads on a Google Cluster , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[39]  Gregory R. Ganger,et al.  alsched: algebraic scheduling of mixed workloads in heterogeneous clouds , 2012, SoCC '12.

[40]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[41]  Tipp Moseley,et al.  Measuring interference between live datacenter applications , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[43]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition , 2013, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition.

[44]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[45]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[46]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[47]  Franck Cappello,et al.  Characterizing Cloud Applications on a Google Data Center , 2013, 2013 42nd International Conference on Parallel Processing.

[48]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[49]  Xiao Zhang,et al.  CPI2: CPU performance isolation for shared compute clusters , 2013, EuroSys '13.

[50]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[51]  Scott Shenker,et al.  Choosy: max-min fair sharing for datacenter jobs with constraints , 2013, EuroSys '13.

[52]  Kento Aida,et al.  Towards Understanding the Usage Behavior of Google Cloud Users: The Mice and Elephants Phenomenon , 2014, 2014 IEEE 6th International Conference on Cloud Computing Technology and Science.

[53]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[54]  Srikanth Kandula,et al.  Multi-resource packing for cluster schedulers , 2014, SIGCOMM.

[55]  Xiao Zhang,et al.  HaPPy: Hyperthread-aware Power Profiling Dynamically , 2014, USENIX Annual Technical Conference.

[56]  Christoforos E. Kozyrakis,et al.  Reconciling high server utilization and sub-millisecond quality-of-service , 2014, EuroSys '14.

[57]  Carlo Curino,et al.  Reservation-based Scheduling: If You're Late Don't Blame Us! , 2014, SoCC.

[58]  Chao Li,et al.  Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale , 2014, Proc. VLDB Endow..

[59]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[60]  Abhishek Verma,et al.  Evaluating job packing in warehouse-scale computing , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[61]  Ariel D. Procaccia,et al.  Beyond Dominant Resource Fairness , 2015, ACM Trans. Economics and Comput..

[62]  Santosh Krishnan,et al.  Google Compute Engine , 2015 .

[63]  Dror G. Feitelson,et al.  Workload Modeling for Computer Systems Performance Evaluation , 2015 .