Semi-automated data center hotspot diagnosis

An increasingly important requirement for energy-efficient data center operation is to diagnose and fix thermal anomalies that sometimes occur due to excessive workload or equipment failures. Today, the task of diagnosing thermal anomalies entails expert but tedious analysis of data collected manually from disparate management systems. Our ultimate goal is to substantially reduce the time, tedium and expertise required to diagnose thermal hotspots by developing a system that generates accurate diagnoses automatically. We describe a substantial step towards this goal: a loosely-coupled, semi-automated thermal diagnosis system that integrates IT and facilities data, uses simple heuristics to highlight the most likely culprits, and provides a graphical interface that enables an administrator to narrow the list further by exploring data correlations. Among the challenges addressed by our solution are coping with heterogeneous data types and data access methods, and detecting and managing erroneous sensor readings.

[1]  Metin Feridun,et al.  A search engine for systems management , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[2]  Jeffrey S. Chase,et al.  Making Scheduling "Cool": Temperature-Aware Workload Placement in Data Centers , 2005, USENIX Annual Technical Conference, General Track.

[3]  Subhash Suri,et al.  An Optimal Algorithm for Euclidean Shortest Paths in the Plane , 1999, SIAM J. Comput..

[4]  Cullen E. Bash,et al.  DIMENSIONLESS PARAMETERS FOR EVALUATION OF THERMAL DESIGN AND PERFORMANCE OF LARGE-SCALE DATA CENTERS , 2002 .

[5]  Michael C. Huang,et al.  A framework for dynamic energy efficiency and temperature management , 2000, MICRO 33.

[6]  Metin Feridun,et al.  Using linked data for systems management , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[7]  Jeffrey O. Kephart,et al.  Robotic mapping and monitoring of data centers , 2011, 2011 IEEE International Conference on Robotics and Automation.

[8]  Margaret Martonosi,et al.  Dynamic thermal management for high-performance microprocessors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[9]  José González,et al.  Thermal-aware clustered microarchitectures , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[10]  George Forman,et al.  Cool Job Allocation: Measuring the Power Savings of Placing Jobs at Cooling-Efficient Locations in the Data Center , 2007, USENIX Annual Technical Conference.

[11]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[12]  Madhusudan K. Iyengar,et al.  Challenges of data center thermal management , 2005, IBM J. Res. Dev..

[13]  Jeffrey S. Chase,et al.  Balance of power: dynamic thermal management for Internet data centers , 2005, IEEE Internet Computing.