Data Center Workload Monitoring , Analysis , and Emulation

Over the last ten years we have witnessed a shift from large mainframe computing to commodity, off-the-shelf clusters of servers. Today’s data centers contain thousands or tens of thousands of servers, providing services and computation for tens or hundreds of thousands of users. In addition to traditional IT challenges such as server management, security, and performance, data center owners now must deal with power and thermal issues, previously the domain of facilities management. These trends will continue to accelerate as organizations acquire bladed servers and consolidate multiple, smaller clusters into centrally-located data centers. However, in spite of these trends, there has been no corresponding change in emphasis in the methods and toolkits that target system instrumentation, analysis, management, replay, and emulation. This paper seeks to address this gap. We focus on methods and toolkits to enable the automated collection and analysis of workload traces from data centers, and use those traces as the basis for repeatable and verifiable experiments and workload emulation. Our work has two components: a locationand environment-aware extended knowledge plane that places thermal and power management concerns at the same level as service performance, collecting and analyzing facilities and performance data with particular focus on causal relationships across this boundary, and data analysis and and workload playback methods that allow detailed and flexible emulation of enterprise-class workloads. We discuss the high-level architectural requirements for these two components and present results from specific implementations and toolkits.

[1]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[2]  Jennifer D. Mitchell-Jackson,et al.  ENERGY NEEDS IN AN INTERNET ECONOMY: A CLOSER LOOK AT DATA CENTERS , 2001 .

[3]  Ian T. Foster,et al.  Grid information services for distributed resource sharing , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[4]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OPSR.

[5]  Wei Hong,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Tag: a Tiny Aggregation Service for Ad-hoc Sensor Networks , 2022 .

[6]  K. Walsh,et al.  Scalability and accuracy in a large-scale network emulator , 2002, OSDI.

[7]  Cullen E. Bash,et al.  Thermal considerations in cooling large scale high compute density data centers , 2002, ITherm 2002. Eighth Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (Cat. No.02CH37258).

[8]  Dejan Kostic,et al.  Scalability and accuracy in a large-scale network emulator , 2002, CCRV.

[9]  Rebecca Isaacs,et al.  Performance analysis in loosely-coupled distributed systems , 2002 .

[10]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[11]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[12]  David D. Clark,et al.  A knowledge plane for the internet , 2003, SIGCOMM '03.

[13]  Eric A. Brewer,et al.  Using Runtime Paths for Macroanalysis , 2003, HotOS.

[14]  David E. Irwin,et al.  Dynamic virtual clusters in a grid site manager , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[15]  Larry L. Peterson,et al.  Sophia: an Information Plane for networked systems , 2004, Comput. Commun. Rev..

[16]  Jeffrey S. Chase,et al.  Making Scheduling "Cool": Temperature-Aware Workload Placement in Data Centers , 2005, USENIX Annual Technical Conference, General Track.

[17]  G. Weikum Querying the Internet with PIER , 2005 .