Performance troubleshooting in data centers: an annotated bibliography?

In the emerging cloud computing era, enterprise data centers host a plethora of web services and applications, including those for e-Commerce, distributed multimedia, and social networks, which jointly, serve many aspects of our daily lives and business. For such applications, lack of availability, reliability, or responsiveness can lead to extensive losses. For instance, on June 29 2010, Amazon.com experienced three hours of intermittent performance problems as the normally reliable website took minutes to load items, and searches came back without product links. Customers were also unable to place orders. Based on their 2010 quarterly revenues, such downtime could cost Amazon up to $1.75 million per hour, thus making rapid problem resolution critical to its business. In another serious incident, on July 7, 2010, DBS bank in Singapore suffered a 7-hour outage which crippled its Internet banking systems, and disrupted other consumer banking services, including automated teller machines, credit card and NETS payments. The cascading failure occurred due to a procedural error while replacing a faulty component in one of the bank’s storage systems that was connected to its main computers. The high-cost of downtime in large-scale distributed systems drives the need for troubleshooting tools that can quickly detect problems and point system administrators to potential solutions. The increasing size and complexity of enterprise applications, coupled with the large scale of data centers in which they operate, make troubleshooting extremely challenging. Problems can arise due to a large variety of root-causes because of the complex interactions between hardware and software systems. The large volume of monitoring data available in these systems can obscure the root-cause of these problems. Lastly, the multi-tier nature of applications composed of entirely different subsystems man-

[1]  George Candea,et al.  Autonomous recovery in componentized Internet applications , 2006, Cluster Computing.

[2]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[3]  Julio César López-Hernández,et al.  Stardust: tracking activity in a distributed storage system , 2006, SIGMETRICS '06/Performance '06.

[4]  Kivanc M. Ozonat An information-theoretic approach to detecting performance anomalies and changes for large-scale distributed web services , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[5]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM.

[6]  Felix C. Gärtner,et al.  Fundamentals of fault-tolerant distributed computing in asynchronous environments , 1999, CSUR.

[7]  Rajeev Gandhi,et al.  Black-Box Problem Diagnosis in Parallel File Systems , 2010, FAST.

[8]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[9]  Justin Cappos,et al.  San Fermín: Aggregating Large Data Sets Using a Binomial Swap Forest , 2008, NSDI.

[10]  Matti A. Hiltunen,et al.  Building Survivable Services Using Redundancy and Adaptation , 2003, IEEE Trans. Computers.

[11]  Rachid Guerraoui,et al.  Software-Based Replication for Fault Tolerance , 1997, Computer.

[12]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[13]  Jane Hillston,et al.  Quality of service of crash-recovery failure detectors , 2007 .

[14]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[15]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[16]  Úlfar Erlingsson,et al.  Fay: extensible distributed tracing from kernels to clusters , 2011, SOSP '11.

[17]  Armando Fox,et al.  Detecting application-level failures in component-based Internet services , 2005, IEEE Transactions on Neural Networks.

[18]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[19]  Ada Gavrilovska,et al.  A practical approach for 'zero' downtime in an operational information system , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[20]  Albert G. Greenberg,et al.  Scarlett: coping with skewed content popularity in mapreduce clusters , 2011, EuroSys '11.

[21]  Daniel A. Reed,et al.  Monitoring Large Systems Via Statistical Sampling , 2004, Int. J. High Perform. Comput. Appl..

[22]  Rajeev Gandhi,et al.  Ganesha: blackBox diagnosis of MapReduce systems , 2010, PERV.

[23]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.

[24]  Sheng Ma,et al.  Quickly Finding Known Software Problems via Automated Symptom Matching , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[25]  Paramvir Bahl,et al.  Detailed diagnosis in enterprise networks , 2009, SIGCOMM '09.

[26]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[27]  Chun Zhang,et al.  vPath: Precise Discovery of Request Processing Paths from Black-Box Observations of Thread and Network Activities , 2009, USENIX Annual Technical Conference.

[28]  Xu Chen,et al.  Automating Network Application Dependency Discovery: Experiences, Limitations, and New Solutions , 2008, OSDI.

[29]  Haifeng Chen,et al.  PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems , 2010, ICAC '10.

[30]  Wei-Ying Ma,et al.  Automated known problem diagnosis with event traces , 2006, EuroSys.

[31]  Chengwei Wang,et al.  EbAT: online methods for detecting utility cloud anomalies , 2009, MDS '09.

[32]  Darrell D. E. Long,et al.  A longitudinal survey of Internet host reliability , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[33]  Vanish Talwar,et al.  A flexible architecture integrating monitoring and analytics for managing large-scale data centers , 2011, ICAC '11.

[34]  Manish Marwah,et al.  Enhanced server fault-tolerance for improved user experience , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[35]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[36]  Sheng Ma,et al.  Real-time problem determination in distributed systems using active probing , 2004, 2004 IEEE/IFIP Network Operations and Management Symposium (IEEE Cat. No.04CH37507).

[37]  Alan L. Cox,et al.  Whodunit: transactional profiling for multi-tier applications , 2007, EuroSys '07.

[38]  Yue Zhang,et al.  Toward automatic policy refinement in repair services for large distributed systems , 2010, OPSR.

[39]  Abhishek Kumar,et al.  Lightweight, High-Resolution Monitoring for Troubleshooting Production Systems , 2008, OSDI.

[40]  Christopher Stewart,et al.  Performance modeling and system management for multi-component online services , 2005, NSDI.

[41]  Anima Anandkumar,et al.  Tracking in a spaghetti bowl: monitoring transactions using footprints , 2008, SIGMETRICS '08.

[42]  Vanish Talwar,et al.  Ranking anomalies in data centers , 2012, 2012 IEEE Network Operations and Management Symposium.

[43]  Xiaozhou Li,et al.  Efficient tracing and performance analysis for large distributed systems , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[44]  Vanish Talwar,et al.  Online detection of utility cloud anomalies using metric distributions , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[45]  Yin Zhang,et al.  Usenix Association 8th Usenix Symposium on Operating Systems Design and Implementation 87 Network Imprecision: a New Consistency Metric for Scalable Monitoring , 2022 .

[46]  Kishor S. Trivedi,et al.  Analysis and implementation of software rejuvenation in cluster systems , 2001, SIGMETRICS '01.

[47]  Julie A. McCann,et al.  A survey of autonomic computing—degrees, models, and applications , 2008, CSUR.

[48]  Chun Yuan,et al.  A Reinforcement Learning Approach to Automatic Error Recovery , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[49]  Qiang Fu,et al.  Mining Invariants from Console Logs for System Problem Detection , 2010, USENIX Annual Technical Conference.

[50]  Xiaohui Gu,et al.  On Predictability of System Anomalies in Real World , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[51]  Jeffrey P. Buzen,et al.  MASF - Multivariate Adaptive Statistical Filtering , 1995, Int. CMG Conference.

[52]  Charles E. McDowell,et al.  Debugging concurrent programs , 1989, ACM Comput. Surv..

[53]  Shicong Meng,et al.  REMO: Resource-Aware Application State Monitoring for Large-Scale Distributed Systems , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[54]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[55]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[56]  Haixun Wang,et al.  Adaptive system anomaly prediction for large-scale hosting infrastructures , 2010, PODC.

[57]  Vanish Talwar,et al.  Monalytics: online monitoring and analytics for managing large scale data centers , 2010, ICAC '10.

[58]  Karsten Schwan,et al.  SysProf: Online Distributed Behavior Diagnosis through Fine-grain System Monitoring , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[59]  Scott Smith,et al.  Scaling a monitoring infrastructure for the Akamai network , 2010, OPSR.

[60]  Gokul Soundararajan,et al.  A query language and runtime tool for evaluating behavior of multi-tier servers , 2010, SIGMETRICS '10.

[61]  Ashvin Goel,et al.  Data recovery for web applications , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[62]  Willy Zwaenepoel,et al.  Dynamic content web applications: Crash, failover, and recovery analysis , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[63]  Rajeev Gandhi,et al.  Draco: Statistical diagnosis of chronic problems in large distributed systems , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[64]  Miroslaw Malek,et al.  Survey of software tools for evaluating reliability, availability, and serviceability , 1988, CSUR.

[65]  Manoj K. Agarwal,et al.  Correlating failures with asynchronous changes for root cause analysis in enterprise environments , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[66]  Jay Lepreau,et al.  Computer System Performance Problem Detection Using Time Series Model , 1993, USENIX Summer.

[67]  KyoungSoo Park,et al.  CoMon: a mostly-scalable monitoring system for PlanetLab , 2006, OPSR.

[68]  Shicong Meng,et al.  Monitoring continuous state violation in datacenters: Exploring the time dimension , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[69]  Tao Yang,et al.  Programming support and adaptive checkpointing for high-throughput data services with log-based recovery , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[70]  Sam Shah,et al.  Root cause detection in a service-oriented architecture , 2013, SIGMETRICS '13.

[71]  Erez Zadok,et al.  DARC: dynamic analysis of root causes of latency distributions , 2008, SIGMETRICS '08.

[72]  Karsten Schwan,et al.  E2EProf: Automated End-to-End Performance Management for Enterprise Systems , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[73]  Zibin Zheng,et al.  A QoS-aware fault tolerant middleware for dependable service composition , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[74]  Dejan S. Milojicic,et al.  Moara: Flexible and Scalable Group-Based Querying System , 2008, Middleware.

[75]  Xin Li,et al.  Reference-driven performance anomaly identification , 2009, SIGMETRICS '09.

[76]  Vanish Talwar,et al.  VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications , 2012, Middleware.

[77]  Vanish Talwar,et al.  Statistical techniques for online anomaly detection in data centers , 2011, 12th IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) and Workshops.

[78]  Karsten Schwan,et al.  iManage: Policy-Driven Self-management for Enterprise-Scale Systems , 2007, Middleware.

[79]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[80]  Hans-Peter Schwefel,et al.  Performability Models for Multi-Server Systems with High-Variance Repair Durations , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[81]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[82]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[83]  Haifeng Chen,et al.  Discovering likely invariants of distributed transaction systems for autonomic system management , 2006, 2006 IEEE International Conference on Autonomic Computing.

[84]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[85]  Karsten Schwan,et al.  A state-space approach to SLA based management , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.

[86]  Karsten Schwan,et al.  Net-cohort: detecting and managing VM ensembles in virtualized data centers , 2012, ICAC '12.

[87]  Haixun Wang,et al.  Online Anomaly Prediction for Robust Cluster Systems , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[88]  Zhenhuan Gong,et al.  Self-correlating predictive information tracking for large-scale production systems , 2009, ICAC '09.

[89]  Thomas Reidemeister,et al.  Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[90]  Evgenia Smirni,et al.  Anomaly? application change? or workload change? towards automated detection of application performance anomaly and change , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[91]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[92]  Jeffrey O. Kephart Autonomic computing: the first decade , 2011, ICAC '11.

[93]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[94]  Haifeng Chen,et al.  Ranking the importance of alerts for problem determination in large computer systems , 2009, ICAC '09.

[95]  Xi Wang,et al.  Intrusion Recovery Using Selective Re-execution , 2010, OSDI.

[96]  Rajeev Gandhi,et al.  Kahuna: Problem diagnosis for Mapreduce-based cloud computing environments , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[97]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[98]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[99]  Vijay Mann,et al.  Problem Determination in Enterprise Middleware Systems using Change Point Correlation of Time Series Data , 2006, 2006 IEEE/IFIP Network Operations and Management Symposium NOMS 2006.

[100]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[101]  Dan Meng,et al.  Precise request tracing and performance debugging for multi-tier services of black boxes , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[102]  Muli Ben-Yehuda,et al.  NAP: a building block for remediating performance bottlenecks via black box network analysis , 2009, ICAC '09.

[103]  Michael Dahlin,et al.  A scalable distributed information management system , 2004, SIGCOMM.

[104]  Shivnath Babu,et al.  Guided Problem Diagnosis through Active Learning , 2008, 2008 International Conference on Autonomic Computing.

[105]  Thomas Reidemeister,et al.  System monitoring with metric-correlation models: problems and solutions , 2009, ICAC '09.

[106]  Vanish Talwar,et al.  vManage: loosely coupled platform and virtualization management in data centers , 2009, ICAC '09.

[107]  Michael I. Jordan,et al.  Failure diagnosis using decision trees , 2004 .