Failure characterization and error detection in distributed web applications

Arshad, Fahad A. Ph.D., Purdue University, August 2014. Failure Characterization and Error Detection in Distributed Web Applications . Major Professor: Saurabh Bagchi. We have seen an evolution of increasing scale and complexity of enterprise-class distributed applications, such as, web services for providing anything from critical infrastructure services to electronic commerce. With this evolution, it has become increasingly difficult to understand how these applications perform, when do they fail, and what can be done to make them more resilient to failures, both due to hardware and due to software? Application developers tend to focus on bringing their applications to market quickly without testing the complex failure scenarios that can disrupt or degrade a given web service. Operators configure these web services without the complete knowledge of how the configurations interact with the various layers. Matters are not helped by ad hoc and often poor quality failure logs generated by even mature and widely used software systems. Worse still, both end users and servers sometime suffer from “silent problems” where something goes wrong without any immediate obvious end-user manifestation. To address these reliability issues, characterizing and detecting software problems with some post-detection diagnosticcontext is crucial. This dissertation first presents a fault-injection and bug repository-based evaluation to characterize silent and non-silent software failures and configuration problems in three-tier web applications and Java EE application servers. Second, for detection of software failures, we develop simple low-cost application-generic and applicationspecific consistency checks, while for duplicate web requests (a class of performance problems), we develop a generic autocorrelation-based algorithm at the server end.

[1]  Helen J. Wang,et al.  Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.

[2]  Sai Zhang ConfDiagnoser: An automated configuration error diagnosis tool for Java software , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[3]  Stephen A. Jarvis,et al.  A System for Dynamic Server Allocation in Application Server Clusters , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[4]  Lorenzo Keller,et al.  ConfErr: A tool for assessing resilience to human configuration errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[5]  Qiang Fu,et al.  Performance Issue Diagnosis for Online Service Systems , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[6]  Erik R. Altman,et al.  Performance analysis of idle programs , 2010, OOPSLA.

[7]  Peter M. Chen,et al.  Whither generic recovery from application faults? A fault study using open-source software , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[8]  Steve Souders High-performance web sites , 2008, CACM.

[9]  Rudolf Eigenmann,et al.  The NEEShub Cyberinfrastructure for Earthquake Engineering , 2011, Computing in Science & Engineering.

[10]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[11]  Armando Fox,et al.  Detecting application-level failures in component-based Internet services , 2005, IEEE Transactions on Neural Networks.

[12]  Saurabh Bagchi,et al.  Dangers and Joys of Stock Trading on the Web: Failure Characterization of a Three-Tier Web Service , 2011, 2011 IEEE 30th International Symposium on Reliable Distributed Systems.

[13]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[14]  Vinton G. Cerf,et al.  A protocol for packet network intercommunication , 1974, CCRV.

[15]  Randy H. Katz,et al.  Precomputing possible configuration error diagnoses , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[16]  Domenico Cotroneo,et al.  Assessing and improving the effectiveness of logs for the analysis of software faults , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[17]  Amin Vahdat,et al.  Life, death, and the critical transition: finding liveness bugs in systems code , 2007 .

[18]  Mona Attariyan,et al.  Automating Configuration Troubleshooting with Dynamic Information Flow Analysis , 2010, OSDI.

[19]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[20]  Barton P. Miller,et al.  Problem Diagnosis in Large-Scale Computing Environments , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[21]  Michael McLennan,et al.  HUBzero: A Platform for Dissemination and Collaboration in Computational Science and Engineering , 2010, Computing in Science & Engineering.

[22]  Jiawei Han,et al.  Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[23]  Luís Moura Silva Comparing Error Detection Techniques for Web Applications: An Experimental Study , 2008, 2008 Seventh IEEE International Symposium on Network Computing and Applications.

[24]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[25]  Gang Huang,et al.  Failure Analysis of Open Source J2EE Application Servers , 2007, Seventh International Conference on Quality Software (QSIC 2007).

[26]  Marco Torchiano,et al.  Are web applications more defect-prone than desktop applications? , 2010, International Journal on Software Tools for Technology Transfer.

[27]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[28]  Saharon Rosset,et al.  Analyzing system logs: a new view of what's important , 2007 .

[29]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[30]  George Candea,et al.  The S2E Platform: Design, Implementation, and Applications , 2012, TOCS.

[31]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[32]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[33]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[34]  Uday Bondhugula,et al.  Believe it or Not! Multicore CPUs can Match GPUs for FLOP-intensive Applications , 2010 .

[35]  Steven D. Gribble,et al.  Configuration Debugging as Search: Finding the Needle in the Haystack , 2004, OSDI.

[36]  Richard P. Martin,et al.  Barricade: defending systems against operator mistakes , 2010, EuroSys '10.

[37]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[38]  Dawson R. Engler,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Cmc: a Pragmatic Approach to Model Checking Real Code , 2022 .

[39]  Frank Piessens,et al.  Sound reasoning about unchecked exceptions , 2007, Fifth IEEE International Conference on Software Engineering and Formal Methods (SEFM 2007).

[40]  Baris Coskun,et al.  Mitigating SMS spam by online detection of repetitive near-duplicate messages , 2012, 2012 IEEE International Conference on Communications (ICC).

[41]  Nuno Laranjeiro,et al.  Robustness Validation in Service-Oriented Architectures , 2008, WADS.

[42]  Nuno Laranjeiro,et al.  Assessing Robustness of Web-Services Infrastructures , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[43]  Junfeng Yang,et al.  Context-based Online Configuration-Error Detection , 2011, USENIX Annual Technical Conference.

[44]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[45]  Henrique Madeira,et al.  Emulation of Software Faults: A Field Data Study and a Practical Approach , 2006, IEEE Transactions on Software Engineering.

[46]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[47]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[48]  Charles P. Shelton,et al.  Robustness testing of the Microsoft Win32 API , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[49]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM.

[50]  Jeffrey C. Mogul,et al.  A trace-based analysis of duplicate suppression in HTTP , 2000 .

[51]  Paolo Rosso,et al.  Detection of near-duplicate user generated contents: the SMS spam collection , 2011, SMUC '11.

[52]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[53]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[54]  Jennifer Neville,et al.  Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems , 2012, NSDI.

[55]  Thomas Reidemeister,et al.  Dependency-aware fault diagnosis with metric-correlation models in enterprise software systems , 2010, 2010 International Conference on Network and Service Management.

[56]  Qiang Fu,et al.  Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[57]  Ravishankar K. Iyer,et al.  A framework for database audit and control flow checking for a wireless telephone network controller , 2001, 2001 International Conference on Dependable Systems and Networks.

[58]  Armando Fox,et al.  Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[59]  Scott Shenker,et al.  Replay debugging for distributed applications , 2006 .

[60]  Ion Stoica,et al.  Friday: Global Comprehension for Distributed Replay , 2007, NSDI.

[61]  Marco Vieira,et al.  On the emulation of software faults by software fault injection , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[62]  Domenico Cotroneo,et al.  Failure classification and analysis of the Java Virtual Machine , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[63]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[64]  Xiao Ma,et al.  An empirical study on configuration errors in commercial and open source systems , 2011, SOSP.

[65]  Mona Attariyan,et al.  AutoBash: improving configuration management with operating system causality analysis , 2007, SOSP.

[66]  Michael Paul Gough,et al.  Parallel processing speed increase of the one-bit auto-correlation function in hardware , 2011, Microprocess. Microsystems.

[67]  Ravishankar K. Iyer,et al.  Characterization of linux kernel behavior under errors , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..