Virtualization drives higher resource utilization and makes provisioning new systems very easy and cheap. This combination has led to an ever-increasing number of virtual machines: the largest data centers will likely have more than 100K in few years, and many deployments will span multiple data centers. Virtual machines are also getting increasingly more capable, consisting of more vCPUs, more memory, and higher-bandwidth virtual I/O devices with a variety of capabilities like bandwidth throttling and traffic mirroring
To reduce the work for IT administrators managing these environments, VMware and other companies provide several monitoring, automation, and policy-driven tools. These tools require a lot of information about various aspects of each VM and other objects in the system, such as physical hosts, storage infrastructure, and networking. To support these tools and the hundreds of simultaneous users who manage the environment, the management software needs to provide secure access to the data in real-time with some degree of consistency and backwardcompatibility, and very high availability under a variety of failures and planned maintenance. Such software must satisfy a continuum of designs: it must perform well at large-scale to accommodate the largest datacenters, but it must also accommodate smaller deployments by limiting its resource consumption and overhead according to demand. The need for high-performance, robust management tools that scale from a few hosts to cloud-scale poses interesting challenges for the management software. This paper presents some of the techniques we have employed to address these challenges
[1]
Luiz André Barroso,et al.
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
,
2009,
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.
[2]
David E. Culler,et al.
The ganglia distributed monitoring system: design, implementation, and experience
,
2004,
Parallel Comput..
[3]
Eduardo Pinheiro,et al.
Failure Trends in a Large Disk Drive Population
,
2007,
FAST.
[4]
Andy Konwinski,et al.
Chukwa: A large-scale monitoring system
,
2008
.
[5]
Vijayaraghavan Soundararajan,et al.
The impact of management operations on the virtualized datacenter
,
2010,
ISCA '10.
[6]
Jeffrey Dean,et al.
Designs, Lessons and Advice from Building Large Distributed Systems
,
2009
.
[7]
Ganesh Venkitachalam,et al.
The design of a practical system for fault-tolerant virtual machines
,
2010,
OPSR.