Performance troubleshooting in data centers: an annotated bibliography?
暂无分享,去创建一个
Rajeev Gandhi | Priya Narasimhan | Karsten Schwan | Michael P. Kasick | Chengwei Wang | Liting Hu | Soila Kavulya | Jiaqi Tan | Mahendra Kutare | K. Schwan | Jiaqi Tan | R. Gandhi | P. Narasimhan | Soila Kavulya | Mahendra Kutare | Chengwei Wang | Liting Hu
[1] George Candea,et al. Autonomous recovery in componentized Internet applications , 2006, Cluster Computing.
[2] George Candea,et al. Microreboot - A Technique for Cheap Recovery , 2004, OSDI.
[3] Julio César López-Hernández,et al. Stardust: tracking activity in a distributed storage system , 2006, SIGMETRICS '06/Performance '06.
[4] Kivanc M. Ozonat. An information-theoretic approach to detecting performance anomalies and changes for large-scale distributed web services , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).
[5] Paramvir Bahl,et al. Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM.
[6] Felix C. Gärtner,et al. Fundamentals of fault-tolerant distributed computing in asynchronous environments , 1999, CSUR.
[7] Rajeev Gandhi,et al. Black-Box Problem Diagnosis in Parallel File Systems , 2010, FAST.
[8] Marcos K. Aguilera,et al. Performance debugging for distributed systems of black boxes , 2003, SOSP '03.
[9] Justin Cappos,et al. San Fermín: Aggregating Large Data Sets Using a Binomial Swap Forest , 2008, NSDI.
[10] Matti A. Hiltunen,et al. Building Survivable Services Using Redundancy and Adaptation , 2003, IEEE Trans. Computers.
[11] Rachid Guerraoui,et al. Software-Based Replication for Fault Tolerance , 1997, Computer.
[12] Amin Vahdat,et al. Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.
[13] Jane Hillston,et al. Quality of service of crash-recovery failure detectors , 2007 .
[14] Donald Beaver,et al. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .
[15] David E. Culler,et al. The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..
[16] Úlfar Erlingsson,et al. Fay: extensible distributed tracing from kernels to clusters , 2011, SOSP '11.
[17] Armando Fox,et al. Detecting application-level failures in component-based Internet services , 2005, IEEE Transactions on Neural Networks.
[18] Robbert van Renesse,et al. Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.
[19] Ada Gavrilovska,et al. A practical approach for 'zero' downtime in an operational information system , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.
[20] Albert G. Greenberg,et al. Scarlett: coping with skewed content popularity in mapreduce clusters , 2011, EuroSys '11.
[21] Daniel A. Reed,et al. Monitoring Large Systems Via Statistical Sampling , 2004, Int. J. High Perform. Comput. Appl..
[22] Rajeev Gandhi,et al. Ganesha: blackBox diagnosis of MapReduce systems , 2010, PERV.
[23] Andrea C. Arpaci-Dusseau,et al. FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.
[24] Sheng Ma,et al. Quickly Finding Known Software Problems via Automated Symptom Matching , 2005, Second International Conference on Autonomic Computing (ICAC'05).
[25] Paramvir Bahl,et al. Detailed diagnosis in enterprise networks , 2009, SIGCOMM '09.
[26] Gang Ren,et al. Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.
[27] Chun Zhang,et al. vPath: Precise Discovery of Request Processing Paths from Black-Box Observations of Thread and Network Activities , 2009, USENIX Annual Technical Conference.
[28] Xu Chen,et al. Automating Network Application Dependency Discovery: Experiences, Limitations, and New Solutions , 2008, OSDI.
[29] Haifeng Chen,et al. PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems , 2010, ICAC '10.
[30] Wei-Ying Ma,et al. Automated known problem diagnosis with event traces , 2006, EuroSys.
[31] Chengwei Wang,et al. EbAT: online methods for detecting utility cloud anomalies , 2009, MDS '09.
[32] Darrell D. E. Long,et al. A longitudinal survey of Internet host reliability , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.
[33] Vanish Talwar,et al. A flexible architecture integrating monitoring and analytics for managing large-scale data centers , 2011, ICAC '11.
[34] Manish Marwah,et al. Enhanced server fault-tolerance for improved user experience , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).
[35] Eric A. Brewer,et al. Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.
[36] Sheng Ma,et al. Real-time problem determination in distributed systems using active probing , 2004, 2004 IEEE/IFIP Network Operations and Management Symposium (IEEE Cat. No.04CH37507).
[37] Alan L. Cox,et al. Whodunit: transactional profiling for multi-tier applications , 2007, EuroSys '07.
[38] Yue Zhang,et al. Toward automatic policy refinement in repair services for large distributed systems , 2010, OPSR.
[39] Abhishek Kumar,et al. Lightweight, High-Resolution Monitoring for Troubleshooting Production Systems , 2008, OSDI.
[40] Christopher Stewart,et al. Performance modeling and system management for multi-component online services , 2005, NSDI.
[41] Anima Anandkumar,et al. Tracking in a spaghetti bowl: monitoring transactions using footprints , 2008, SIGMETRICS '08.
[42] Vanish Talwar,et al. Ranking anomalies in data centers , 2012, 2012 IEEE Network Operations and Management Symposium.
[43] Xiaozhou Li,et al. Efficient tracing and performance analysis for large distributed systems , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.
[44] Vanish Talwar,et al. Online detection of utility cloud anomalies using metric distributions , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.
[45] Yin Zhang,et al. Usenix Association 8th Usenix Symposium on Operating Systems Design and Implementation 87 Network Imprecision: a New Consistency Metric for Scalable Monitoring , 2022 .
[46] Kishor S. Trivedi,et al. Analysis and implementation of software rejuvenation in cluster systems , 2001, SIGMETRICS '01.
[47] Julie A. McCann,et al. A survey of autonomic computing—degrees, models, and applications , 2008, CSUR.
[48] Chun Yuan,et al. A Reinforcement Learning Approach to Automatic Error Recovery , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[49] Qiang Fu,et al. Mining Invariants from Console Logs for System Problem Detection , 2010, USENIX Annual Technical Conference.
[50] Xiaohui Gu,et al. On Predictability of System Anomalies in Real World , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.
[51] Jeffrey P. Buzen,et al. MASF - Multivariate Adaptive Statistical Filtering , 1995, Int. CMG Conference.
[52] Charles E. McDowell,et al. Debugging concurrent programs , 1989, ACM Comput. Surv..
[53] Shicong Meng,et al. REMO: Resource-Aware Application State Monitoring for Large-Scale Distributed Systems , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.
[54] Aaron B. Brown,et al. An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).
[55] Armando Fox,et al. Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.
[56] Haixun Wang,et al. Adaptive system anomaly prediction for large-scale hosting infrastructures , 2010, PODC.
[57] Vanish Talwar,et al. Monalytics: online monitoring and analytics for managing large scale data centers , 2010, ICAC '10.
[58] Karsten Schwan,et al. SysProf: Online Distributed Behavior Diagnosis through Fine-grain System Monitoring , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).
[59] Scott Smith,et al. Scaling a monitoring infrastructure for the Akamai network , 2010, OPSR.
[60] Gokul Soundararajan,et al. A query language and runtime tool for evaluating behavior of multi-tier servers , 2010, SIGMETRICS '10.
[61] Ashvin Goel,et al. Data recovery for web applications , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).
[62] Willy Zwaenepoel,et al. Dynamic content web applications: Crash, failover, and recovery analysis , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.
[63] Rajeev Gandhi,et al. Draco: Statistical diagnosis of chronic problems in large distributed systems , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[64] Miroslaw Malek,et al. Survey of software tools for evaluating reliability, availability, and serviceability , 1988, CSUR.
[65] Manoj K. Agarwal,et al. Correlating failures with asynchronous changes for root cause analysis in enterprise environments , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).
[66] Jay Lepreau,et al. Computer System Performance Problem Detection Using Time Series Model , 1993, USENIX Summer.
[67] KyoungSoo Park,et al. CoMon: a mostly-scalable monitoring system for PlanetLab , 2006, OPSR.
[68] Shicong Meng,et al. Monitoring continuous state violation in datacenters: Exploring the time dimension , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).
[69] Tao Yang,et al. Programming support and adaptive checkpointing for high-throughput data services with log-based recovery , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).
[70] Sam Shah,et al. Root cause detection in a service-oriented architecture , 2013, SIGMETRICS '13.
[71] Erez Zadok,et al. DARC: dynamic analysis of root causes of latency distributions , 2008, SIGMETRICS '08.
[72] Karsten Schwan,et al. E2EProf: Automated End-to-End Performance Management for Enterprise Systems , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[73] Zibin Zheng,et al. A QoS-aware fault tolerant middleware for dependable service composition , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.
[74] Dejan S. Milojicic,et al. Moara: Flexible and Scalable Group-Based Querying System , 2008, Middleware.
[75] Xin Li,et al. Reference-driven performance anomaly identification , 2009, SIGMETRICS '09.
[76] Vanish Talwar,et al. VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications , 2012, Middleware.
[77] Vanish Talwar,et al. Statistical techniques for online anomaly detection in data centers , 2011, 12th IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) and Workshops.
[78] Karsten Schwan,et al. iManage: Policy-Driven Self-management for Enterprise-Scale Systems , 2007, Middleware.
[79] Randy H. Katz,et al. X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.
[80] Hans-Peter Schwefel,et al. Performability Models for Multi-Server Systems with High-Variance Repair Durations , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[81] Albert G. Greenberg,et al. Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.
[82] Michael I. Jordan,et al. Detecting large-scale system problems by mining console logs , 2009, SOSP '09.
[83] Haifeng Chen,et al. Discovering likely invariants of distributed transaction systems for autonomic system management , 2006, 2006 IEEE International Conference on Autonomic Computing.
[84] Jeffrey S. Chase,et al. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.
[85] Karsten Schwan,et al. A state-space approach to SLA based management , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.
[86] Karsten Schwan,et al. Net-cohort: detecting and managing VM ensembles in virtualized data centers , 2012, ICAC '12.
[87] Haixun Wang,et al. Online Anomaly Prediction for Robust Cluster Systems , 2009, 2009 IEEE 25th International Conference on Data Engineering.
[88] Zhenhuan Gong,et al. Self-correlating predictive information tracking for large-scale production systems , 2009, ICAC '09.
[89] Thomas Reidemeister,et al. Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.
[90] Evgenia Smirni,et al. Anomaly? application change? or workload change? towards automated detection of application performance anomaly and change , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).
[91] Mona Attariyan,et al. X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.
[92] Jeffrey O. Kephart. Autonomic computing: the first decade , 2011, ICAC '11.
[93] Gregory R. Ganger,et al. Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.
[94] Haifeng Chen,et al. Ranking the importance of alerts for problem determination in large computer systems , 2009, ICAC '09.
[95] Xi Wang,et al. Intrusion Recovery Using Selective Re-execution , 2010, OSDI.
[96] Rajeev Gandhi,et al. Kahuna: Problem diagnosis for Mapreduce-based cloud computing environments , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.
[97] Richard Mortier,et al. Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.
[98] Xuezheng Liu,et al. D3S: Debugging Deployed Distributed Systems , 2008, NSDI.
[99] Vijay Mann,et al. Problem Determination in Enterprise Middleware Systems using Change Point Correlation of Time Series Data , 2006, 2006 IEEE/IFIP Network Operations and Management Symposium NOMS 2006.
[100] Armando Fox,et al. Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.
[101] Dan Meng,et al. Precise request tracing and performance debugging for multi-tier services of black boxes , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.
[102] Muli Ben-Yehuda,et al. NAP: a building block for remediating performance bottlenecks via black box network analysis , 2009, ICAC '09.
[103] Michael Dahlin,et al. A scalable distributed information management system , 2004, SIGCOMM.
[104] Shivnath Babu,et al. Guided Problem Diagnosis through Active Learning , 2008, 2008 International Conference on Autonomic Computing.
[105] Thomas Reidemeister,et al. System monitoring with metric-correlation models: problems and solutions , 2009, ICAC '09.
[106] Vanish Talwar,et al. vManage: loosely coupled platform and virtualization management in data centers , 2009, ICAC '09.
[107] Michael I. Jordan,et al. Failure diagnosis using decision trees , 2004 .