论文信息 - Large-scale cluster management at Google with Borg

Large-scale cluster management at Google with Borg

Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior. We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.

[1] W. Whitt,et al. Open and closed models for networks of queues , 1984, AT&T Bell Laboratories Technical Journal.

[2] Miron Livny,et al. Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[3] Mary Baker,et al. Availability in the Sprite distributed file system , 1991, OPSR.

[4] Rajesh Raman,et al. Matchmaking: distributed resource management for high throughput computing , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[5] Leslie Lamport,et al. The part-time parliament , 1998, TOCS.

[6] Baruch Awerbuch,et al. An Opportunity Cost Approach for Job Assignment in a Scalable Computing Cluster , 2000, IEEE Trans. Parallel Distributed Syst..

[7] Mark J. Clement,et al. Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[8] Baruch Awerbuch,et al. An Opportunity Cost Approach for Job Assignment and Reassignment in a Scalable Computing Cluster , 2002 .

[9] Luiz André Barroso,et al. Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[10] GhemawatSanjay,et al. The Google file system , 2003 .

[11] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12] Brett D. Fleisch,et al. The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[13] Michael Isard,et al. Autopilot: automatic data center management , 2007, OPSR.

[14] James R. Hamilton,et al. On Designing and Deploying Internet-Scale Services , 2007, LISA.

[15] 刘锋,et al. Kernel-based virtual machine事件跟踪机制的设计与实现 , 2008 .

[16] M. Korupolu,et al. Server-storage virtualization: Integration and load balancing in data centers , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[17] Wilson C. Hsieh,et al. Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[18] Dror G. Feitelson,et al. On Simulation and Design of Parallel-Systems Schedulers: Are We Doing the Right Thing? , 2009, IEEE Transactions on Parallel and Distributed Systems.

[19] A. Zahariev. Google App Engine , 2009 .

[20] Luiz André Barroso,et al. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[21] Andrew V. Goldberg,et al. Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[22] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[23] Craig Chambers,et al. FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[24] Chita R. Das,et al. Towards characterizing cloud backend workloads: insights from Google compute clusters , 2010, PERV.

[25] Andrey Gubarev,et al. Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[26] Paul Turner,et al. CPU bandwidth control for CFS , 2010 .

[27] Rajeev Gandhi,et al. An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[28] Yawei Li,et al. Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[29] Chita R. Das,et al. Modeling and synthesizing task placement constraints in Google compute clusters , 2011, SoCC.

[30] Benjamin Hindman,et al. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[31] Randy H. Katz,et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[32] Kevin Skadron,et al. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33] Raouf Boutaba,et al. Characterizing Task Usage Shapes in Google Compute Clusters , 2011 .

[34] Nathan Linial,et al. No justified complaints: on fair sharing of multiple resources , 2011, ITCS '12.

[35] Yanpei Chen,et al. Design Insights for MapReduce from Diverse Production Workloads , 2012 .

[36] Sheng Di,et al. Characterization and Comparison of Cloud versus Grid Workloads , 2012, 2012 IEEE International Conference on Cluster Computing.

[37] Jerome A. Rolia,et al. Selling T-shirts and Time Shares in the Cloud , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[38] Sangyeun Cho,et al. Characterizing Machines and Workloads on a Google Cluster , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[39] Gregory R. Ganger,et al. alsched: algebraic scheduling of mixed workloads in heterogeneous clouds , 2012, SoCC '12.

[40] Randy H. Katz,et al. Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[41] Tipp Moseley,et al. Measuring interference between live datacenter applications , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[42] Michael Abd-El-Malek,et al. Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[43] Luiz André Barroso,et al. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition , 2013, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition.

[44] Carlo Curino,et al. Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[45] Patrick Wendell,et al. Sparrow: distributed, low latency scheduling , 2013, SOSP.

[46] Christina Delimitrou,et al. Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[47] Franck Cappello,et al. Characterizing Cloud Applications on a Google Data Center , 2013, 2013 42nd International Conference on Parallel Processing.

[48] Daniel Mills,et al. MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[49] Xiao Zhang,et al. CPI2: CPU performance isolation for shared compute clusters , 2013, EuroSys '13.

[50] Luiz André Barroso,et al. The tail at scale , 2013, CACM.

[51] Scott Shenker,et al. Choosy: max-min fair sharing for datacenter jobs with constraints , 2013, EuroSys '13.

[52] Kento Aida,et al. Towards Understanding the Usage Behavior of Google Cloud Users: The Mice and Elephants Phenomenon , 2014, 2014 IEEE 6th International Conference on Cloud Computing Technology and Science.

[53] Christina Delimitrou,et al. Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[54] Srikanth Kandula,et al. Multi-resource packing for cluster schedulers , 2014, SIGCOMM.

[55] Xiao Zhang,et al. HaPPy: Hyperthread-aware Power Profiling Dynamically , 2014, USENIX Annual Technical Conference.

[56] Christoforos E. Kozyrakis,et al. Reconciling high server utilization and sub-millisecond quality-of-service , 2014, EuroSys '14.

[57] Carlo Curino,et al. Reservation-based Scheduling: If You're Late Don't Blame Us! , 2014, SoCC.

[58] Chao Li,et al. Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale , 2014, Proc. VLDB Endow..

[59] Wei Lin,et al. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[60] Abhishek Verma,et al. Evaluating job packing in warehouse-scale computing , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[61] Ariel D. Procaccia,et al. Beyond Dominant Resource Fairness , 2015, ACM Trans. Economics and Comput..

[62] Santosh Krishnan,et al. Google Compute Engine , 2015 .

[63] Dror G. Feitelson,et al. Workload Modeling for Computer Systems Performance Evaluation , 2015 .