Processing Diagnosis Queries: A Principled and Scalable Approach

Many popular Web sites suffer occasional user-visible problems such as slow responses, blank pages or error messages being displayed, items not being added to shopping carts, database slowdowns, and others. Such deviations of systems from desired behavior, or failures, can cause user dissatisfaction and considerable loss of revenue. The scale, complexity, and dynamics of modern systems make it hard to track down the cause of failures manually. We address this problem through a new class of declarative queries, called diagnosis queries, that a system administrator or user can pose to pinpoint the cause of a failure. We describe how diagnosis queries are specified over system-monitoring data, and the challenges faced by current techniques to process these queries. We develop and evaluate a new algorithm, based on a combination of clustering and classification, to process diagnosis queries automatically, efficiently, and with good accuracy.