PAD: Performance Anomaly Detection in Multi-server Distributed Systems

Multi-server distributed systems are becoming increasingly popular with the emergence of cloud computing. These systems need to provide high throughput with low latency, which is a difficult task to achieve. Manual performance tuning and diagnosis of such systems, however, is hard as the amount of relevant performance diagnosis data is large. To help system developers with performance diagnosis, we have developed a tool called Performance Anomaly Detector (PAD). PAD combines user-driven navigation analysis with automatic correlation and comparative analysis techniques. The combination results in a powerful tool that can help find a number of performance anomalies. Based on our experience in applying PAD to the Orleans system, we discovered that PAD was able to reduce developer time and effort detecting anomalous performance cases and improve a developer's ability to perform deeper analysis of such behaviors.

[1]  Hermann Kopetz,et al.  Clock Synchronization in Distributed Real-Time Systems , 1987, IEEE Transactions on Computers.

[2]  Sudheendra Hangal,et al.  Tracking down software bugs using automatic anomaly detection , 2002, ICSE '02.

[3]  Jing Xu,et al.  Rule-based automatic software performance diagnosis and improvement , 2008, WOSP '08.

[4]  Ahmed E. Hassan,et al.  Automatic detection of performance deviations in the load testing of Large Scale Systems , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[5]  Paul Barford,et al.  Network Performance Anomaly Detection and Localization , 2009, IEEE INFOCOM 2009.

[6]  Mario Baum Basic Statistical Analysis , 2016 .

[7]  Michael Stuart,et al.  Understanding Robust and Exploratory Data Analysis , 1984 .

[8]  Benny Rochwerger,et al.  Oceano-SLA based management of a computing utility , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[9]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[10]  Vittorio Cortellessa,et al.  Performance Antipatterns as Logical Predicates , 2010, 2010 15th IEEE International Conference on Engineering of Complex Computer Systems.

[11]  Gilbert Hamann,et al.  Automatic Comparison of Load Tests to Support the Performance Analysis of Large Enterprise Systems , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[12]  Vittorio Cortellessa,et al.  A Process to Effectively Identify "Guilty" Performance Antipatterns , 2010, FASE.

[13]  Ajay D. Kshemkalyani,et al.  An Efficient Implementation of Vector Clocks , 1992, Inf. Process. Lett..

[14]  Michael I. Jordan,et al.  Characterizing, modeling, and generating workload spikes for stateful services , 2010, SoCC '10.

[15]  James R. Larus,et al.  Orleans: cloud computing for everyone , 2011, SoCC.

[16]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[17]  Schahram Dustdar,et al.  Comprehensive QoS monitoring of Web services and event-based SLA violation detection , 2009, MWSOC '09.

[18]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[21]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[22]  Gilbert Hamann,et al.  Automated performance analysis of load tests , 2009, 2009 IEEE International Conference on Software Maintenance.

[23]  Ying Zou,et al.  Mining Performance Regression Testing Repositories for Automated Performance Analysis , 2010, 2010 10th International Conference on Quality Software.

[24]  Jennifer Neville,et al.  Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems , 2012, NSDI.

[25]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[26]  Cheng Li,et al.  A study of the internal and external effects of concurrency bugs , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[27]  Anne Koziolek,et al.  Detection and solution of software performance antipatterns in palladio architectural models , 2011, ICPE '11.

[28]  Connie U. Smith,et al.  New Software Performance AntiPatterns: More Ways to Shoot Yourself in the Foot , 2002, Int. CMG Conference.

[29]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[30]  Yan Han,et al.  On the Clouds: A New Way of Computing , 2010 .